A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
A modern Python build backend

trampolim A modern Python build backend. Features Task system, allowing to run arbitrary Python code during the build process (Planned) Easy to use CL

Filipe Laíns 39 Nov 08, 2022
For radiometrically calibrating and PSF deconvolving IRIS data

irispreppy For radiometrically calibrating and PSF deconvolving IRIS data. I dislike how I need to own proprietary software (IDL) just to simply prepa

Aaron W. Peat 4 Nov 01, 2022
GWAS summary statistics files QC tool

SSrehab dependencies: python 3.8+ a GNU/Linux with bash v4 or 5. python packages in requirements.txt bcftools (only for prepare_dbSNPs) gz-sort (only

21 Nov 02, 2022
A simple streamlit webapp with multiple functionality

A simple streamlit webapp with multiple functionality

Omkar Pramod Hankare 2 Nov 24, 2021
Import modules and files straight from URLs.

Import Python code from modules straight from the internet.

Nate 2 Jan 15, 2022
Public Management System for ACP's 24H TT Fronteira 2021

CROWD MANAGEMENT SYSTEM 24H TT Vila de Froteira 2021 This python script creates a dashboard with realtime updates regarding the capacity of spectactor

VOST Portugal 1 Nov 24, 2021
ripgrep recursively searches directories for a regex pattern while respecting your gitignore

ripgrep (rg) ripgrep is a line-oriented search tool that recursively searches the current directory for a regex pattern. By default, ripgrep will resp

Andrew Gallant 35k Dec 31, 2022
KUIZ is a web application quiz where you can create/take a quiz for learning and sharing knowledge from various subjects, questions and answers.

KUIZ KUIZ is a web application quiz where you can create/take a quiz for learning and sharing knowledge from various subjects, questions and answers.

Thanatibordee Sihaboonthong 3 Sep 12, 2022
Vita Specific Patches and Application for Doki Doki Literature Club (Steam Version) using Ren'Py PSVita

Doki-Doki-Literature-Club-Vita Vita Specific Patches and Application for Doki Doki Literature Club (Steam Version) using Ren'Py PSVita Contains: Modif

Jaylon Gowie 25 Dec 30, 2022
chiarose(XCR) based on chia(XCH) source code fork, open source public chain

chia-rosechain 一个无耻的小活动 | A shameless little event 如果您喜欢这个项目,请点击star 将赠送您520朵玫瑰,可以去 facebook 留下您的(xcr)地址,和github用户名。 If you like this project, please

ddou123 376 Dec 14, 2022
A cheat sheet for streamlit

Streamlit Cheat Sheet App to summarise streamlit docs v1.0.0 There is also an accompanying png and pdf version https://github.com/daniellewisDL/stream

Daniel Lewis 221 Jan 04, 2023
Pokemon sword replay capture

pokemon-sword-replay-capture This is an old version (March 2020) pokemon-sword-replay-capture-mar-2020-version of my Pokemon Replay Capture software.

11 May 15, 2022
Algorand Python API examples

Algorand-Py Algorand Python API examples This repo will hold example scripts to monitor activities on Algorand main net. You can: Monitor your assets

Karthik Dutt 2 Jan 23, 2022
Youtube Channel Website

Videos-By-Sanjeevi Youtube Channel Website YouTube Channel Website Features: Free Hosting using GitHub Pages and open-source code base in GitHub. It c

Sanjeevi Subramani 5 Mar 26, 2022
A library for pattern matching on symbolic expressions in Python.

MatchPy is a library for pattern matching on symbolic expressions in Python. Work in progress Installation MatchPy is available via PyPI, and

High-Performance and Automatic Computing 151 Dec 24, 2022
Eros is an expiremental programming language built using simple Python code.

Eros is an expiremental programming language built using simple Python code. Featuring an easy syntax and unique features like type slicing, the language remains an expirement that grows in down time

zxro 2 Nov 21, 2021
🔤 Measure edit distance based on keyboard layout

clavier Measure edit distance based on keyboard layout. Table of contents Table of contents Introduction Installation User guide Keyboard layouts Dist

Max Halford 42 Dec 18, 2022
Recreating my first CRUD in python, but now more professional

Recreating my first CRUD in python, but now more professional

Ricardo Deo Sipione Augusto 2 Nov 27, 2021
A Non profit app built on top of Frappe framework & ERPNext

Non Profit A Non profit app built on top of Frappe framework & ERPNext. People who change the world need the tools to do it! The Non Profit Modules of

Frappe 16 Nov 17, 2022
A promo calculator for sports betting odds.

Sportbetter Calculation Toolkit Parlay Calculator This is a quick parlay calculator that considers some of the common promos offered. It is used to id

Luke Bhan 1 Sep 08, 2022