AWS Glue PySpark - Apache Hudi Quick Start Guide

Overview

AWS Glue PySpark - Apache Hudi Quick Start Guide

Disclaimer:

This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Glue.

It's also specifically configured for the following Glue version:

  • AWS Glue 3.0
    • Spark 3.1.1
    • Python 3.7

Glue Configuration Reference: https://docs.aws.amazon.com/glue/latest/dg/add-job.html

Apache Hudi Reference: https://hudi.apache.org/docs/quick-start-guide/ for more information

Prerequisites:

- Python 3.6 or higher
- AWS CLI - Profile named 'dev' with Administrator Access (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html)

Folder Structure:

glue-hudi-hello
├── README.md
├── cloud-formation
│   ├── command.md
│   └── GlueJobPySparkHudi.yaml
├── jars
│   ├── command.md
│   ├── hudi-spark3-bundle_2.12-0.9.0.jar
│   └── spark-avro_2.12-3.0.1.jar
├── job
│   ├── command.md
│   └── job.py
│   └── upload_job.py
├── requirements.txt

Step 1: Create and activate a virtualenv:

Create a new virtual environment for the project in its root directory:

python3 -m venv venv

Activate it:

source venv/bin/activate

Run from the root directory the pip install to get boto3.

pip install -r requirements.txt

Step 2: Create the AWS Resources:

Now, with a aws configured profile named as dev, cd into the cloud-formation folder and run the command in command.md.

As a AWS Cloud Formation exercise, read the command Parameters and how they are used on the GlueJobPySparkHudi.yaml file to dynamically create the Glue Job and S3 Bucket.

Step 3: Upload the Job and Jars to S3:

cd into the job folder and run the command in command.md.

cd into the jars folder and run the commands in command.md. Note: There is one command for each jar.

Step 4: Check AWS Resources results:

Log into aws console and check the Glue Job and S3 Bucket.

On the AWS Glue console, you can run the Glue Job by clicking on the job name.

After the job is finished, you can check the Glue Data Catalog and query the new database from AWS Athena.

On AWS Athena check for the database: hudi_demo and for the table: hudi_trips.

Owner
Gabriel Amazonas Mesquita
Gabriel Amazonas Mesquita
🎄 JustaGrabber - A discord token grabber written in python3

🎄 JustaGrabber - A discord token grabber written in python3 🎇 Made by kldiscord https://github.com/kldiscord 🌟 Please leave a star if you liked Jus

1 Dec 19, 2022
A discord self bot that replies to messages using cleverbot

cleverbot-discord-self A discord self bot that replies to messages using cleverbot Bot will respond to DMs and channels in the channels list. Need to

0 Jan 11, 2022
Connects to a local SenseCap M1 Helium Hotspot and pulls API Data.

sensecap_api_checker_HELIUM Connects to a local SenseCap M1 Helium Hotspot and pulls API Data.

Lorentz Factr 1 Nov 03, 2021
Passive income method via SerpClix. Uses a bot to accept clicks.

SerpClixBotSearcher This bot allows you to get passive income from SerpClix. Each click is usually $0.10 (sometimes $0.05 if offer isnt from the US).

Jason Mei 3 Sep 01, 2021
tgEasy's Official Assistant Bot and Example Bot

tgEasy Assistant The assistant bot that helps people with tgEasy directly on Telegram. This repository contains the source code of @tgEasyRobot and th

Divide Projects™ 4 Dec 26, 2022
Scrape the Twitter Frontend API without authentication.

Twitter Scraper 🇰🇷 Read Korean Version Twitter's API is annoying to work with, and has lots of limitations — luckily their frontend (JavaScript) has

Buğra İşgüzar 3.4k Jan 08, 2023
Currency And Gold Prices - Currency And Gold Prices For Python

Currency_And_Gold_Prices Photos from the app New Update Show range Change better

Ali HemmatNia 4 Sep 19, 2022
Randomly selects two teams based on who is in a voice channel on Discord

TeamPickerDiscordBot Randomly selects two teams based on who is in a voice channel on Discord What I Learned The ins and outs of Python as this was my

Brecken Enneking 2 Jan 27, 2022
The official Python library for Shodan

shodan: The official Python library and CLI for Shodan Shodan is a search engine for Internet-connected devices. Google lets you search for websites,

John Matherly 2.1k Dec 31, 2022
Get informed when your DeFI Earn CRO Validator is jailed or changes the commission rate.

CRO-DeFi-Warner Intro CRO-DeFi-Warner can be used to notify you when a validator changes the commission rate or gets jailed. It can also notify you wh

5 May 16, 2022
Twitter FakeNFT With Python

This project is a server that fetches your Twitter profile picture and applies the hexagonal transparency mask displayed on the profiles of users who have an NFT profile picture.

Mathis HAMMEL 29 Apr 23, 2022
Music bot because Octave is down and I can : )

Chords On a mission to build the best Discord Music Bot View Demo · Report Bug · Request Feature Table of Contents About The Project Built With Gettin

Aman Prakash Jha 53 Jan 07, 2023
Elkeid HUB - A rule/event processing engine maintained by the Elkeid Team that supports streaming/offline data processing

Elkeid HUB - A rule/event processing engine maintained by the Elkeid Team that supports streaming/offline data processing

Bytedance Inc. 61 Dec 29, 2022
Simple Python script that lets you upload image/video to imgur

Pymgur 🐍 Simple Python script that lets you upload image/video to imgur! Usage 🔨 Git Clone this repository install the requirements (pip install -r

3 Feb 20, 2022
Telegram Bot For Screenshot Generation.

Screenshotit_bot Telegram Bot For Screenshot Generation. Description An attempt to implement the screenshot generation of telegram files without downl

1 Nov 06, 2021
Python implementation of Spotify's authorization flow.

Spotify API Apps 🎷 🎶 🎼 This repository consists of many strange codes that make you think why the hell this guy doing this. Well... I got some reas

5 Dec 17, 2021
Code release for Transferable Curriculum for Weakly-Supervised Domain Adaptation (AAAI2019)

TCL Code release for Transferable Curriculum for Weakly-Supervised Domain Adaptation (AAAI2019) Dataset Office-31 dataset, with 0.4 label noise Requir

THUML @ Tsinghua University 17 Jul 07, 2022
Super simple anti-spam Discord bot

AutoAntiRaidBot Super simple anti-spam Discord bot. Will automatically kick any member with an account made under 1 day ago, and will ban any member w

Kainoa Kanter 6 Jun 27, 2022
Exchange indicators & Basic functions for Binance API.

binance-ema Exchange indicators & Basic functions for Binance API. This python library has been written to calculate SMA, EMA, MACD etc. functions wit

Emre MENTEŞE 24 Jan 06, 2023
Telegram Voice Chat UserBot made with Pyrogram and MarshalX/tgcalls with playlist and Heroku support

Telegram Voice Chat UserBot A Telegram UserBot to Play Audio in Voice Chats. This is also the source code of the userbot which is being used for playi

Calls Music 164 Nov 12, 2022