A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Python based Spotify account generator.

Spotify Account Generator Python based Spotify account generator. Installation Download the latest release, open command prompt in the folder, run pip

polo 5 Dec 27, 2022
Represents a Lavalink client used to manage nodes and connections.

lavaplayer Represents a Lavalink client used to manage nodes and connections. setup pip install lavaplayer setup lavalink you need to java 11* LTS or

HazemMeqdad 37 Nov 21, 2022
Fully automated Chegg Discord bot for "homework help"

cheggbog Fully automated Chegg Discord bot for "homework help". Working Sept 15, 2021 Overview Recently, Chegg has made it extremely difficult to auto

Bryce Hackel 8 Dec 23, 2022
Stream Telegram files to web

Telegram File Stream Bot A Telegram bot to stream files to web Demo Bot » Report a Bug | Request Feature Table of Contents About this Bot Original Rep

Wrench 572 Jan 09, 2023
unofficial source of the discord bot, “haunting.” created by: vorqz, vert, & Veltz

hauntingSRC unofficial source of the discord bot, “haunting.” created by: vorqz, vert, & Veltz reasoning: creators skidded the most of this bot and do

Vast 11 Nov 04, 2022
Pycord, a maintained fork of discord.py, is a python wrapper for the Discord API

pycord A fork of discord.py. PyCord is a modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Mo

Pycord Development 2.3k Dec 31, 2022
M3U Playlist for free TV channels

Free TV This is an M3U playlist for free TV channels around the World. Either free locally (over the air): Or free on the Internet: Plex TV Pluto TV P

Free TV 964 Jan 08, 2023
A simple Telegram bot which handles images in whole different way

zeroimagebot thezeroimagebot 🌟 I Can Edit Dimension Of An image which is required by @stickers 🌟 I Can Extract Text From An Image 🌟 !!! New Updates

RAVEEN KUMAR 4 Jul 01, 2021
FAIR Enough Metrics is an API for various FAIR Metrics Tests, written in python

☑️ FAIR Enough metrics for research FAIR Enough Metrics is an API for various FAIR Metrics Tests, written in python, conforming to the specifications

Maastricht University IDS 3 Jul 06, 2022
A twitter multi-tool for OSINT on twitter accounts.

TwitterCheckr A twitter multi-tool for OSINT on twitter accounts. Infomation TwitterCheckr also known as TCheckr is multi-tool for OSINT on twitter a

IRIS 16 Dec 23, 2022
A Bot Upload file|video To Telegram using given Links.

A Bot Upload file|video To Telegram using given Links.

Hash Minner 19 Jan 02, 2023
Fast and small Discord-Toolset.

Mooncord 🌙 Discord server: https://discord.gg/frnpk2rg Fast and small Discord-Toolset. Enjoy? Star this repo ⭐ (Main file in Mooncord/Moon-1.0.1/vers

7ua 9 Dec 11, 2021
A Telegram bot that searches for the original source of anime, manga, and art

A Telegram bot that searches for the original source of anime, manga, and art How to use the bot Just send a screenshot of the anime, manga or art or

Kira Kormak 9 Dec 28, 2022
A google search telegram bot.

Google-Search-Bot A google search telegram bot. Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.c

Fayas Noushad 37 Nov 24, 2022
Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and other financial information. This repository provides an SDK for developing applications to access the NCDS.

Nasdaq Cloud Data Service (NCDS) Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and ot

Nasdaq 8 Dec 01, 2022
A Powerful, Smart And Simple Userbot In Pyrogram.

Eagle-USERBOT 🇮🇳 A Powerful, Smart And Simple Userbot In Pyrogram. Support 🚑 Inspiration & Credits Userge-X Userge Pokurt Pyrogram Code Owners Mast

Masterolic 1 Nov 28, 2021
IACR Events Scraper

IACR Events Scraper This scrapes https://iacr.org/events/ and exports it as a calendar file. I host a version of this for myself under https://arrrr.c

Karolin Varner 6 May 28, 2022
Uploader-Bot - A Modified Telegram Url Uploader Bot With Mongodb, Zee5, Sonyliv Support and Many Other Yt-dlp Sites

𝚁𝚎𝚚𝚞𝚒𝚛𝚎𝚍 𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎𝚜 🔊 APP_ID API_HASH TG_BOT_TOKEN DATABASE_URL

11 Sep 10, 2022
Collaboration with Microsoft, AWS, Google, and ETHZürich Analytics Club (2022 Datathon Project)

DATATHON_ Collaboration with Microsoft, AWS, Google, and ETHZürich Analytics Club (2022 Datathon Project) Datathon Original Challenge SAV DataDays Rei

esthi 34 Nov 10, 2022
Python script to replace BTC adresses in the clipboard with similar looking ones, whose private key can be retrieved by a netcat listener or similar.

BTCStealer Python script to replace BTC adresses in the clipboard with similar looking ones, whose private key can be retrieved by a netcat listener o

Some Person 6 Jun 07, 2022