US Airline Market Analysis
Description
- Objective of the project is to analyse flight data for US for a period of 20 years(~128 million rows) to find patterns to gauge different choices of travel.
- Identifying trends and patterns over the period of time.
- Using Spark Dataframe and Spark SQL for handling large data efficiently and Amazon S3 as the data lake.
Technology Used
- Big Data Analysis using Spark
- Amazon Web Services
Environment
Python(PySpark), Spark SQL, Databricks, Amazon S3,Apache Parquet
Analysis: Notebook
Architecture
Analysis 1 : Exploratory Data Analysis of Data:
Analysis 2 : Impact of Global Recession:
Analysis 3 : Fraud Data Analysis:
Performance Optmization
- Repartitioning of the dataframe with optimized number of partitions & Speeding up Shuffle.partitions.
- Pulling data sets into a cluster-wide in-memory cache.
- Using Apache Parquet,columnar storage format which support flexible compression options and also provides an efficient encoding system.