ELT Data Engineering Project- Trip Analysis#
This is an end-to-end data engineering project, This project uses Taxi Trips dataset
Technology Stack#
The following technologies are used to build this project
Google Cloud Storage (GCS) - as Data Lake
Prefect - for orchestration
dbt - for transformation and data modeling
Google BigQuery - for Data Warehouse
Google Looker studio - for visualizations
Data Pipeline Architecture#

Dashboard#
Reproduce it yourself#
First clone this repo to your local machine.
git clone https://github.com/Khunmi/ELT_Project_DZC
Setup your Google Cloud environment
Create a Google Cloud Platform project
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges:
Viewer
Storage Admin
Storage Object Admin
BigQuery Admin
Download the JSON credentials and save it, e.g. to
~/.gc/<credentials>
Install the Google Cloud SDK
Let the environment variable point to your GCP key, authenticate it and refresh the session token
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
Check out this link for a video walkthrough.
Setup your orchestration
If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
Create the prefect blocks via the cloud UI or adjust the variables in
/prefect/prefect_blocks.py
and run
python prefect/block/make_gcp_blocks.py
To execute the flow, run the following commands in two different terminals
prefect agent start -q 'default'
python prefect docs/parameterized_flow.py
After running the flow ‘parameterized_flow.py’ which takes about 25 mins to complete, all data sets will be uploaded to the data lake as parquet files
Data tranformation and modeling using dbt
I decided to create a free dbt cloud developer account and setup my google credential with it to enable communication between Google Bigquery and dbt
kindly refer to the core and staging models i created for more dbt tranformation docs and syntax.
dbt build --var 'is_test_run: false'
You will get 4 tables in Citibike_data_dbt data set
dim_neighbourhoods
Citibike_stations_lookup
facts_citibike
stg_citibike_data
Data Vizualization and Dashboarding
You can now query the data and connect it to looker to visualize the data, when connecting to data source use facts table to build the data source in looker, dont use partitioned table, as you wont get any data in your report.
go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.
Future work
Add more tests
Add CI/CD pipeline
Make incremental model, to append only the latest data from each load
Transformations, documentation and data quality tests in dbt
Streaming process with Kafka and Spark