songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Last update: Jul 13, 2021

Related tags

Overview

Songplays User activity datamart

The following document describes the model used to build the songplays datamart table and the respective ETL process.

About
Getting Started
Data Model and Schema
Deployment
Built Using
Authors

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema

Source files and owners

File or table	Description	Directory	Owner
YYYY-MM-DD-events.json	User events.	mnt/data/log_data/YYYY/11	Person 1
.json	Song data.	mnt/data/song_data/a	Person 2
`songplays`	Datamart for recomendation system.	`sparkify_app.songplays`	Person 3
`artists`	Dimension table for artists.	`sparkify_app.artists`	Person 1
`songs`	Dimension table for songs.	`sparkify_app.songs`	Person 1
`time`	Dimension table for streaming start time for a given song.	`sparkify_app.time`	Person 2
`users`	Dimension table for users.	`sparkify_app.users`	Person 3

Prerequisites

To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

Open your browser and access localhost:16543:
- Enter with the following credentials to authenticate:
  - e-mail: [email protected]
  - password: sp4rk1fy
After you log in, click on the Servers option at the upper corner on the left:
- You will be asked to enter with the PostgreSQL credentials:
  - User: sparkifypsql
  - Password: p4ssw0rd
Select the Query Tools under the Tools menu:

Under the Query Editor, run the following query:

SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;

You should get only 5 rows.

Microservice architecture

The following image represents the microservice architecture for this project:

Where:

sparkify_python: runs all Python scripts and stores raw data.
sparkify_postgres: runs Postgre and stores the database.
sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

Dbeaver - Database tool.
Docker Compose - Tool to run multi-container applications.
Docker Engine - Container engine.
pandas - Data analysis and data wrangling tool.
pgAdmin - PostgreSQL tool.
psycopg2 - Database adapter for Python.
PostgreSQL - Reletional database management system.

✍️ Authors

@lkellermann - Idea & Initial work

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Related tags

Overview

Songplays User activity datamart

Table of Contents

About

🏁 Getting Started

Data Model and Schema

Prerequisites

Microservice architecture

⛏️ Built Using

✍️ Authors

Owner

Leandro Kellermann de Oliveira

PyIOmica (pyiomica) is a Python package for omics analyses.

Pip install minimal-pandas-api-for-polars

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Data pipelines built with polars

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Streamz helps you build pipelines to manage continuous streams of data

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Instant search for and access to many datasets in Pyspark.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

TextDescriptives - A Python library for calculating a large variety of statistics from text

A simplified prototype for an as-built tracking database with API

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

A distributed block-based data storage and compute engine

A 2-dimensional physics engine written in Cairo

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Evaluation of a Monocular Eye Tracking Set-Up