NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Overview

Project 3: Web APIs & NLP

Problem Statement

How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration?

The goal of the project is to see how these two ideologically similar subreddits perceive Biden and his term as president so far.

Success in this project isn't to necessarily develop a model that accurately predicts consistently, but rather to convey what issues these two ideologies care about and the overall sentiment both subreddits have regarding Biden. Considering a lot of this information will be rather focused on EDA, it's hard to necessarily judge the success of this project on the individual models created, rather the success of this project will be determined primarily in the EDA, Visualization, and Presentation sections of the actual project. With that being said however, I will still use a wide variety of models to determine the predictive value of the data I gathered.

Hypothesis: I believe that the two subreddits will differ significantly on what issues they discuss and their sentiment towards Biden, I think because of these differences a model can be made that can accurately predict which post belongs to who. Primarily, I will be focusing on the differences between these subreddits in sentiment and words used.

Data Collection

When collecting data, I initially didn't have the problem statement in mind necessarily before I started. When I began data collecting, I knew I wanted to do something political specifically on the Biden admin post innaguration but I really wanted to go through the process experimenting with different subreddits which made for an interesting situation.

I definitely learned a lot more about the API going into the data collection process blind,such as knowing to avoid deleted posts by excluding "[deleted]" from the selftext among other things, especially about using score and created_utc for gathering posts. I would say the most difficult process was just finding subreddits and then subsequently seeing if they have enough posts while trying to construct different problem statements using the viable subreddits.

At the end, I decided on just choosing r/neoliberal and r/libertarian, there might've been easier options for model creation but personally, I found it a lot more interesting especially since I already browse r/neoliberal fairly frequently so I was invested in the analysis.

Data Cleaning and EDA

When performing data cleaning and EDA, I really did these two tasks in two seperate notebooks. My logistic regression notebook and in my notebook dedicated to EDA and data cleaning. The reason for that being, I initially just had the logistic regression notebook but then wanted to do further analysis on vectorized sets so I created it's own notebook for that while still at times referencing ideal vectorizer parameters I found in my logistic regression notebook.

Truth be told, I did some cleaning in the data gathering notebook, just checking if there were any duplicates or if there were any oddities that I found and I didn't find much, there might have been a few removed posts that snuck in to my analysis but truth be told, it wasn't anything warranting an editing of my data gathering techniques or anything that would stop me from using the data I already gathered.

EDA primarily was just trying to find words that stuck out using count vectorizers, luckily, that was fairly easy to do considering the NLP process came fairly naturally to me. I used lemmatizers for model creation but I rarely used it for my actual EDA, I primarily just used a basic tokenizer without any added features. The bulk of my presentation directly comes from this and domain knowledge where I can create conclusions from the information gathered from this EDA process. EDA helped present a narrative that I was able to fully formulate with my domain knowledge which then resulted in the conclusions found in my presentation.

Another part of EDA that was critical, was the usage of sentiment analysis to find the difference in overall tone between the two subreddits on Biden, this was especially important in my analysis as it also ended up being apart of my preprocessing as well. Sentiment analysis was used in my presentation to present the differences in tone towards Biden but also emphasize the amount of neturality in the posts themselves, this is due primarily to the posts being titles of politically neutral news titles or tweets.

Preprocessing and Modelling

Modelling was a very tenuous process and Preprocessing as well because a lot of it was very memory intensive which resulted in a lot of time spent baby-sitting my laptop but ultimately it provided a lot of valuable information not only on the data I was investigating but also on the models I was using. I used bagging classifiers, logistic regression models, decision trees, random forest models, and boosted models. All of these I had to very mixed success but logistic regression was the one I had the most consistency with, especially with self text exclusive posts. Random forest, decision trees, and boosted models, I all had high expectations for but was not as consistently effective as the logistic regression models. Due to general model underperformance, I will be primarily talking about the logistic regression models I created in the logreg notebook as I had dedicated the most time finetuning those models and had generally more consistent performance with those models than I did others.

I specifically had massive troubles with predicting neoliberal posts while Libertarian posts, I generally managed a decent rate at. My specificity was a lot better than my sensitivity. When I judged my model's ability to predict, I looked at self-text, title-exclusive, and total text. This allowed me to individually look at what each model was good at predicting and also what data to gather the next time I interact with this API.

My preprocessing was very meticulous, specifically experimenting with different vectorizer parameters when using my logistic regression model. Adjustment of parameters and the addition of sentiment scores to try and help the model's performance. Adjusting the vectorizer parameters such as binary and others were heavily tweaked depending on the X variable used (selftext, title, totaltext).

Conclusion

When analyzing this data, it is clear that there are three key takeaways from my modeling process and EDA stage.

  1. The overwhelming neutrality in the text (specifically the title) itself, can hide the true opinions of those in the subreddit.

  2. Predictive models are incredibly difficult to perform on these subreddits in particular and potentially other political subreddits.

  3. The issues in which the subreddits most differ on, is primarily due to r/Libertarian focusing more on surveillance and misinformation in the media while r/Neoliberal is concerned with global politics, climate, and sitting senate representatives.

  4. They both discuss tax, covid, stimulus, china and other current topics relatively often

Sources Used

Britannica Definition of Libertarianism

Neoliberal Project

Stanford Philosophy: Libertarianism

Stanford Philosophy: Neoliberalism

Neoliberal Podcast: Defining Neoliberalism

r/Libertarian

r/neoliberal

Owner
Adam Muhammad Klesc
Hopeful data scientist. Currently in General Assembly and taking their data science immersive course!
Adam Muhammad Klesc
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

Sivateja A T 2 Feb 08, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Ian 1 Jan 15, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

Daulet Nurmanbetov 72 Dec 30, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

NVIDIA Corporation 1k Nov 17, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 03, 2023
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
This is a NLP based project to extract effective date of the contract from their text files.

Date-Extraction-from-Contracts This is a NLP based project to extract effective date of the contract from their text files. Problem statement This is

Sambhav Garg 1 Jan 26, 2022
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022