ELT Data Engineering Project- Trip Analysis

ELT Data Engineering Project- Trip Analysis#

This is an end-to-end data engineering project, This project uses Taxi Trips dataset

Technology Stack#

The following technologies are used to build this project

  • Google Cloud Storage (GCS) - as Data Lake

  • Prefect - for orchestration

  • dbt - for transformation and data modeling

  • Google BigQuery - for Data Warehouse

  • Google Looker studio - for visualizations

Data Pipeline Architecture#

Visual Summary

Dashboard#

Reproduce it yourself#

  1. First clone this repo to your local machine.

git clone https://github.com/Khunmi/ELT_Project_DZC

  1. Setup your Google Cloud environment

export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login

Check out this link for a video walkthrough.

  1. Setup your orchestration

  • If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here

  • Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run

python prefect/block/make_gcp_blocks.py
  • To execute the flow, run the following commands in two different terminals

prefect agent start -q 'default'
python prefect docs/parameterized_flow.py
  • After running the flow ‘parameterized_flow.py’ which takes about 25 mins to complete, all data sets will be uploaded to the data lake as parquet files

  1. Data tranformation and modeling using dbt

  • I decided to create a free dbt cloud developer account and setup my google credential with it to enable communication between Google Bigquery and dbt

  • kindly refer to the core and staging models i created for more dbt tranformation docs and syntax.

dbt build --var 'is_test_run: false'

You will get 4 tables in Citibike_data_dbt data set

  • dim_neighbourhoods

  • Citibike_stations_lookup

  • facts_citibike

  • stg_citibike_data

Data Vizualization and Dashboarding

  • You can now query the data and connect it to looker to visualize the data, when connecting to data source use facts table to build the data source in looker, dont use partitioned table, as you wont get any data in your report.

  • go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.

Future work

  • Add more tests

  • Add CI/CD pipeline

  • Make incremental model, to append only the latest data from each load

  • Transformations, documentation and data quality tests in dbt

  • Streaming process with Kafka and Spark