Seamless Data Migration: Transferring Data from Excel to PostgreSQL with Airflow and Pyspark

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Seamless Data Migration: Transferring Data from Excel to PostgreSQL with Airflow and Pyspark

NeST Digital

@nestdigital

February 16, 2024

Big Data Analytics

Apache Airflow serves as an open-source platform designed to develop, schedule, and monitor batch-oriented workflows. Leveraging its extensible Python framework, Airflow empowers users to construct workflows that seamlessly integrate with a wide array of technologies. This tutorial guides you through the process of crafting Apache Airflow DAGs to execute Apache Spark jobs.

The DAG comprises two tasks:

Task 1: the initial task reads an Excel file from the local environment, transforms it into a CSV file, and saves it to a specified output location in the local folder.

Task 2: The subsequent task reads the CSV file, applies additional transformations, and stores the results in a PostgreSQL table.

Required Steps for the Execution

1. Preparing an environment for Airflow in WSL

2. Prerequisites — WSL should be installed with PostgreSQL and PySpark

Preparing the environment for Airflow

For the current scenario, installing Airflow on WSL which involves the following steps:

Step 1: Make a new directory for Airflow, Update Package Lists and Install dependencies for Airflow.mkdir airflow
sudo apt update
sudo apt install -y python3 python3-pip python3-venv

Step 2: Create a Virtual Environment for Airflowpython3 -m venv airflow-venv
source airflow-venv/bin/activate

Step 3: Install Apache Airflow using ‘pip’pip install apache-airflow

By default, Airflow uses SQLite, which is intended for development purposes only. So for production scenarios, you should consider setting up a database backend to PostgreSQL or MySQL. For the current scenario, we are deploying PostgreSQL as a metadata database. The essential configuration modifications in the /airflow.cfg file within the /airflow directory, created during Airflow installation, governs settings like config files, log directories, and dags_folder where Airflow builds its DAGs.

Open the airflow.cfg by your preferred editor like vim.

The first thing you have to change is the executor which is a very important variable in airflow.cfg that determines the level of parallelization of running tasks or DAGs. This option accepts the following values:

· SequentialExecutor – which operates locally, handling tasks one at a time.

· LocalExecutor class executes tasks locally in parallel, utilizing the multiprocessing Python library and a queuing technique.

· CeleryExecutor, DaskExecutor, and KubernetesExecutor – For distributed task execution and improved availability these classes are available.

LocalExecutor is the preferred choice in this scenario, as parallel task execution is essential, and high availability is not a priority at this stage.executor = LocalExecutor

2. sql_alchemy_conn — This crucial configuration parameter in airflow.cfg defines the database type for Airflow’s metadata interactions. In this case, PostgreSQL is selected, and the variable is set as follows:sql_alchemy_conn = postgresql+psycopg2://airflow_user:pass@192.168.10.10:5432/airflow_db

Environment setup for PySpark and PostgreSQL for running Airflow

There are some basic setup needed for PostgreSQL and PySpark which includes,

PostgreSQL Environment Setup

You have to create the required user, roles, databases, schema, and tables in PostgreSQL for storing the metadata and the final output from Spark. Go to the Airflow venv and run,sudo apt upgrade
sudo apt install PostgreSQL
sudo service postgresql start

#to get superuser privileges in Postgres
sudo su – postgres

#create a superuser
CREATE USER [user] WITH PASSWORD ‘password’ SUPERUSER;

#command to interact with the PostgreSQL
psql

#for listing the users
\du

#for listing the databases
\l

Now create new DB’s for storing the metadata & spark outputCREATE DATABASE airflow_db;
CREATE DATABASE employee_details;

Now connect to the “etl_db” database and Create a schema and table in the “employee_details” table\c etl_db;

CREATE SCHEMA etl_db;

CREATE TABLE etl_db.employee_details (
Name VARCHAR(255),
Age INTEGER,
City VARCHAR(255),
Salary INTEGER,
Date_of_Birth DATE
);

Now run the below command,airflow standalone

This will initialize the database, create a user, and start all components.

To access the Airflow UI: Visit localhost:8080 in your browser and log in with the admin account details shown in the terminal.

Take another terminal go to airflow venv and run the commands for the required libraries for connecting Airflow with PostgreSQL.pip install psycopg2-binary
pip install apache-airflow-providers-postgres

Spark Environment Setup

go to the Airflow venv and install the Provider Package for Sparkpip install apache-airflow-providers-apache-spark

Install PostgreSQL-42.2.26.jar in airflow venvwget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.26/postgresql-42.2.26.jarsudo chmod 777 postgresql-42.2.26.jar#to be used in the code to read the excel files
pip install openpyxl

Now start the Spark Cluster and set the local IP and master IP for Spark by running the below commands,export SPARK_LOCAL_IP=127.0.1.1
export SPARK_MASTER_IP=127.0.1.1

Set Spark app home variable in the Airflow UI — this is very useful to define a global variable in Airflow to be used in any DAG. Useful for getting this variable in code

· Go to Airflow UI’s Admin tab and then Variables,

Set up a new connection for Spark in Airflow UI,

· Go to the Admin tab and then “Connections” and add a connection

How to Set Up the DAG Script

The DAG script orchestrates a data pipeline that involves the ingestion of Excel data, transformation using a PySpark application, and loading the transformed data into PostgreSQL.

· DAG Configuration:

The DAG is named ‘excel_spark_airflow_dag’ and is configured with default arguments such as the owner, start date, and email notifications on failure or retry.

· Excel Data Ingestion (PythonOperator):

The ‘excel_data_ingestion task’ is a PythonOperator that executes the transformXL function from the excel_ingestor module. This function reads an Excel file (input_data.xlsx), transforms the data, and saves it as a CSV file (output.csv). The task is set to run on a schedule of every hour (schedule_interval=”0 * * * *”).

· PySpark Application Submission (SparkSubmitOperator):

The load_spark_psql task uses the SparkSubmitOperator to submit a PySpark application (load_spark_psql.py). This application likely performs further transformations or processing on the data and loads it into a PostgreSQL database.

It is configured with parameters such as the Spark connection (spark_standalone_conn), executor cores, memory, driver memory, and packages required for PostgreSQL connectivity.

· Task Dependencies:

The excel_data_ingestion task is set to precede the load_spark_psql task (excel_data_ingestion >> load_spark_psql), indicating that the PySpark application should run after the Excel data has been ingested and transformed.

· Variable Usage:

The PYSPARK_APP_HOME variable is used to define file paths, and it is retrieved using Variable.get(“PYSPARK_APP_HOME”). Ensure this variable is defined in Airflow with the correct path.

· Timezone Configuration:

The DAG specifies the timezone as ‘Asia/Kolkata’ for scheduling purposes.

· Retry and Email Notifications:

The DAG is configured to retry on failure (retries: 0) and send email notifications on failure or retry.

Save the below code as excel_spark_airflow_dag.py in the DAG folder location

https://medium.com/media/39987c99048001e78a5adf09693fa12a

Now create the Python scripts “excel_ingestor.py” which contains “tranformXL” function and “load_spark_plsql.py” which contains “transformDB” function. Place these files inside the “pyspark_app_home” variable directory, and using this variable in the code we can access those files.

excel_ingestor.py

https://medium.com/media/cb74e70bc2bd703b1ffa3348f409cf58

load_spark_psql.py

https://medium.com/media/001c43f466c52c0f3eff3fc32b8e166d

How to Test the Workflow

In Airflow UI’s DAG tab, go to excel_spark_airflow_dag and UnPause the toggle button to start the workflow.

DAG Execution Errors and Resolution

https://medium.com/media/2ce32f4411f56eb94d97eab349eb7144

During the execution of the second task, errors were encountered, and these issues can be attributed to the following reasons:

Issue: Spark Cluster Configuration

It appears that the SPARK_LOCAL_IP and SPARK_MASTER_IP environment variables need to be set in the Airflow environment. In our current setup, Airflow attempts to execute the Spark code with the Spark cluster running locally.

Resolution Steps:

1. Navigate to the Spark sbin directory and Set the SPARK_LOCAL_IP and SPARK_MASTER_IPcd /opt/spark/sbin
export SPARK_LOCAL_IP=127.0.1.1
export SPARK_MASTER_IP=127.0.1.1

2. Start the Spark cluster

3. Rerun the DAG.

As you can see both the tasks ran successfully and there you have it — your ETL data pipeline in Airflow. We can see the output written in the PostgreSQL table,

References

Installation of Airflow™ – Airflow Documentation

This page describes installation options that you might use when considering how to install Airflow™. Airflow consists…

airflow.apache.org

Tutorials – Airflow Documentation

airflow.apache.org

Apache Airflow PySpark data engineering PostgreSQL Data Science Workflow

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

Download Attachment

NeST Digital

NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

30 Jan 2025

How Data Analytics Transforms Revenue Cycle Management in Healthcare

Kaneshwari Pa..

@Kaneshwari Patil

20 Jul 2023

Big Data Analytics

Highlights: Data analytics is revolutionizing healthcare revenue cycle management (RCM), providing invaluable insights and optimizing revenue generation. Inefficient RCM processes can lead to 5% to 10% revenue loss, but data analytics helps reduce…

Role of Analytics Solutions in the BFSI Sector

Tanya Gupta

@tanyagupta

18 Jul 2023

BFSI Analytics Data Science & AI Community Big Data Analytics

In today's digital age, the BFSI (Banking, Financial Services, and Insurance) sector faces numerous challenges and opportunities. To stay competitive and meet the ever-changing demands of customers, businesses in this sector are turning to advanced…

Demystifying Data Mesh - Modern way of implementing domain driven, decentralized data products

NeST Digital

@nestdigital

17 Jul 2023

Big Data Analytics

The term bottleneck is very popular in the software engineering context and most of us are well aware of what it is. Simply put, it refers to the limitation of a single component in a system of multiple components, which restricts the performance…

Data Lake Implementation: A Comprehensive Guide

Tanya Gupta

@tanyagupta

13 Jul 2023

Mobile & Web Development

In today's data-driven world, organizations are constantly seeking efficient and scalable solutions to manage and derive insights from vast amounts of data. One such solution that has gained significant popularity is the data lake. In this blog post…

Recession-proofing: 5 reasons Why Banks are Embracing the Cloud Amidst Economic Slowdown

Kaneshwari Pa..

@Kaneshwari Patil

30 Jun 2023

Cloud Computing Big Data Analytics

Picture this: you’re navigating through stormy economic waters, trying to keep your financial ship afloat amidst the turbulence of a recession. Every decision you make is crucial, and the last thing you need is a costly investment that could sink…

A beginner’s guide to predictive analytics

Kellton

@Kellton

30 Jun 2023

Data Science & AI Community Big Data Analytics

Allied Market Research expects the global predictive analytics market to grow to S$35.45 billion by 2027 at a compound annual growth rate (CAGR) of 21.9%. The global market today is highly unpredictable, thanks to ever-changing customer…

New

Seamless Data Migration: Transferring Data from Excel to PostgreSQL with Airflow and Pyspark

NeST Digital

Installation of Airflow™ – Airflow Documentation

This page describes installation options that you might use when considering how to install Airflow™. Airflow consists…

Tutorials – Airflow Documentation

airflow.apache.org

NeST Digital

How Data Analytics Transforms Revenue Cycle Management in Healthcare

Kaneshwari Pa..

Role of Analytics Solutions in the BFSI Sector

Tanya Gupta

Demystifying Data Mesh - Modern way of implementing domain driven, decentralized data products

NeST Digital

Data Lake Implementation: A Comprehensive Guide

Tanya Gupta

Recession-proofing: 5 reasons Why Banks are Embracing the Cloud Amidst Economic Slowdown

Kaneshwari Pa..

A beginner’s guide to predictive analytics

Kellton

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

Seamless Data Migration: Transferring Data from Excel to PostgreSQL with Airflow and Pyspark

Share this blog

Related blogs

Harish Kumar Ajjan

25 Feb 2025

Xoriant

25 Feb 2025

Harish Kumar Ajjan

24 Feb 2025

Harish Kumar Ajjan

21 Feb 2025

NuSummit

20 Feb 2025

Harish Kumar Ajjan

19 Feb 2025

Harish Kumar Ajjan

13 Feb 2025

Harish Kumar Ajjan

12 Feb 2025

Systems Plus

05 Feb 2025

Harish Kumar Ajjan

04 Feb 2025

Harish Kumar Ajjan

03 Feb 2025

Harish Kumar Ajjan

30 Jan 2025

About Us

Knowledge Center

In the News

Newsletter