ELT Data Engineering Project- Trip Analysis

Contents

ELT Data Engineering Project- Trip Analysis#

This is an end-to-end data engineering project, This project uses Taxi Trips dataset

Technology Stack#

The following technologies are used to build this project

Google Cloud Storage (GCS) - as Data Lake
Prefect - for orchestration
dbt - for transformation and data modeling
Google BigQuery - for Data Warehouse
Google Looker studio - for visualizations

Data Pipeline Architecture#

Visual Summary

Dashboard#

Reproduce it yourself#

First clone this repo to your local machine.

git clone https://github.com/Khunmi/ELT_Project_DZC

Setup your Google Cloud environment

Create a Google Cloud Platform project
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
Download the JSON credentials and save it, e.g. to ~/.gc/<credentials>
Install the Google Cloud SDK
Let the environment variable point to your GCP key, authenticate it and refresh the session token

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Check out this link for a video walkthrough.

Setup your orchestration

If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run

python prefect/block/make_gcp_blocks.py

To execute the flow, run the following commands in two different terminals

prefect agent start -q 'default'

python prefect docs/parameterized_flow.py

After running the flow ‘parameterized_flow.py’ which takes about 25 mins to complete, all data sets will be uploaded to the data lake as parquet files

Data tranformation and modeling using dbt

I decided to create a free dbt cloud developer account and setup my google credential with it to enable communication between Google Bigquery and dbt
kindly refer to the core and staging models i created for more dbt tranformation docs and syntax.

dbt build --var 'is_test_run: false'

You will get 4 tables in Citibike_data_dbt data set

dim_neighbourhoods
Citibike_stations_lookup
facts_citibike
stg_citibike_data

Data Vizualization and Dashboarding

You can now query the data and connect it to looker to visualize the data, when connecting to data source use facts table to build the data source in looker, dont use partitioned table, as you wont get any data in your report.
go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.

Future work

Add more tests
Add CI/CD pipeline
Make incremental model, to append only the latest data from each load
Transformations, documentation and data quality tests in dbt
Streaming process with Kafka and Spark