Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Discord bot code to stop users that are scamming with fake messages of free discord nitro on servers in order to steal users accounts.

AntiScam Discord bot code to stop users that are scamming with fake messages of free discord nitro on servers in order to steal users accounts. How to

H3cJP 94 Dec 15, 2022
A battle-tested Django 2.1 project template with configurations for AWS, Heroku, App Engine, and Docker.

For information on how to use this project template, check out the wiki. {{ project_name }} Table of Contents Requirements Local Setup Local Developme

Lionheart Software 64 Jun 15, 2022
A jokes api python module

A jokes api python module

Fayas Noushad 3 Nov 28, 2021
Tools for use in DeFi. Impermanent Loss calculations, staking and farming strategies, coingecko and pancakeswap API queries, liquidity pools and more

DeFi open source tools Get Started Instalation General Tools Impermanent Loss, simple calculation Compare Buy & Hold with Staking and Farming Complete

Juan Pablo Pisano 467 Jan 08, 2023
Cryptocurrency Trading Bot - A trading bot to automate cryptocurrency trading strategies using Python, equipped with a basic GUI

Cryptocurrency Trading Bot - A trading bot to automate cryptocurrency trading strategies using Python, equipped with a basic GUI. Used REST and WebSocket API to connect to two of the most popular cry

Francis 8 Sep 15, 2022
Connect your Nintendo Switch playing status to Discord!

Disclaimer: Unfortunately, it appears that Nintendo has removed returning self-Presence in their API as of recently, making this project near obsolete

Deltaion Lee 145 Dec 30, 2022
Python 3 SDK/Wrapper for Huobi Crypto Exchange Api

This packages intents to be an idiomatic PythonApi wrapper for https://www.huobi.com/ Huobi Api Doc: https://huobiapi.github.io/docs Showcase TODO Con

3 Jul 28, 2022
Pixiv 爬虫,使用 Python 实现。支持批量下载、上传到图床。

用 Python 实现的 Pixiv 爬虫,支持批量下载和上传。 随机图片 API: https://loliapi.ml/ Deploy Github Action 集成部署 建议使用本方法部署,相较于本地部署,无需搭建环境,全程在线上完成。并且使用国外服务器下载、上传,网络更加通畅。 Fork

18 Feb 26, 2022
Automatically check for free Anmeldung appointments.

Berlin Anmeldung Appointments (Python) This Python script will automatically check for free Anmeldung appointments in Berlin, and find them for you. T

Martín Aberastegue 6 May 19, 2022
Ciclo 1 - MisiónTIC - UIS (Retos)

misiontic_uis Ciclo 1 - MisiónTIC - UIS Reto 1: Fundamentos del Lenguaje Python Reto 2: Estructuras de Control Condicional Reto 3: Estructuras de Cont

9 May 24, 2022
A simple bot that looks for names and cpfs in the vaccination list made available by the government Fortaleza - CE

A simple bot that looks for names and cpfs in the vaccination list made available by the government Fortaleza - CE

Breno Aquino 1 Dec 21, 2021
A Telegram Bot to Play Audio in Voice Chats With Youtube and Deezer support. Supports Live streaming from youtube Supports Mega Radio Fm Streamings

Bot To Stream Musics on PyTGcalls with Channel Support. A Telegram Bot to Play Audio in Voice Chats With Supports Live streaming from youtube and Mega

Shamil Habeeb 37 Dec 15, 2022
Provide fine-grained push access to GitHub from a JupyterHub

github-app-user-auth Provide fine-grained push access to GitHub from a JupyterHub. Goals Allow users on a JupyterHub to grant push access to only spec

Yuvi Panda 20 Sep 13, 2022
A chatbot that helps you set price alerts for your amazon products.

Amazon Price Alert Bot Description A Telegram chatbot that helps you set price alerts for amazon products. The bot checks the price of your watchliste

Rittik Basu 24 Dec 29, 2022
Import Notion Tasks to

Notion-to-Google-Calendar (1 way) Import Notion Tasks to Google Calendar NO MORE UPDATES WILL BE MADE TO THIS REPO. Attention has been put on a 2-way

12 Aug 11, 2022
It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

Permission Bot 디스코드 내에 있는 message-components 를 연구하기 위하여 제작된 봇입니다. Setup /config/config_example.ini 파일을 /config/config.ini으로 변환합니다. config 파일의 기본 양식은 아

gunyu1019 4 Mar 07, 2022
A Fork of Gitlab's Permifrost tool for managing Snowflake Permissions

permifrost-fork This is a fork of the GitLab permifrost project. As the GitLab team is not currently maintaining the project, we've taken on maintenac

Hightouch 7 Oct 13, 2021
Plugin for Sentry which allows sending notification via Telegram messenger.

Sentry Telegram Plugin for Sentry which allows sending notification via Telegram messenger. Presented plugin tested with Sentry from 8.9 to 9.1.1. DIS

Shmele 208 Dec 30, 2022
A powerful application to automatically deploy GitHub Release.

A powerful application to automatically deploy GitHub Release.

Fentaniao 43 Sep 17, 2022
This discord bot preview user 42intra login picture.

42intra_Pic BOT This discord bot preview user 42intra login picture. created by: @YOPI#8626 Using: Python 3.9 (64-bit) (You don't need 3.9 but some fu

Zakaria Yacoubi 7 Mar 22, 2022