NLP_0-project

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures¹. We are a "democratic" and collaborative group of five, and I mentioned our names based on our initial work division below 😄 .

Here is the outline of our project:

Data collection.

@LeiyuanHuo, jyang130, FanFanShark, xdc1999, gaojiamin1116

Based on file data-WRDS-list.csv, write a web-scraping algorithm to download all 10-Ks (html format) these companies filed to the SEC within 2010 to 2022 at Historical EDGAR documents, and rename them data-10K-COMPNAME-Year.html.
Parse html files to extract Business and MD&A sections.

Text Processing: feature extraction²

Part of Speech Tagging (POS) (mainly this method) to get product name, descriptions. Store these for each company.
Named Entity Recognition (NER) (also mainly this method) to get mentioned competitor names. Store these for each company.
Product texts: BoW and tf-idf for each company's product(s), and hopefully we have a term-product matrix then.
Competitor texts: definitely BoW, as we care about the frequency of being mentioned.
‼️ We also need to combine sector and firm size/market power into competitor texts and re-count.

Text Processing: feature transformation and representation²

Term-product matrix: calculate cosine similarity scores for products pairwise; use score threshold to cluster products into similar groups.
Term-product matrix: directly apply clustering method (e.g., KMeans clustering) to product vectors, and cluster them.

Econometric Analysis and Hypothesis Testing²

Multivariate regression: DV is profitability (e.g., sales, revenue, Tobin's q), IV is competition measures (one from similar product count, one from mentions as competitors), also include relevant control variables.
Cross-section portfolios: our competition measures are cross-sectional (one for each year), so we can create long-short portfolios for both measures, and examine stock return effects.

Two papers inspired this project. Citations: Eisdorfer, A., Froot, K., Ozik, G., & Sadka, R. (2021). Competition Links and Stock Returns. The Review of Financial Studies, The Review of financial studies, 2021-12-20. && Hoberg, G., & Phillips, G. (2016). Text-Based Network Industries and Endogenous Product Differentiation. The Journal of Political Economy, 124(5), 1423-1465. ↩
Text processing processes are based on MFIN7036 Lecture_Notes and a review paper. Citation: Marty, T., Vanstone, B., & Hahn, T. (2020). News media analytics in finance: A survey. Accounting and Finance (Parkville), 60(2), 1385-1434. ↩ ↩ ² ↩ ³

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²

Owner

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

Voice control for Garry's Mod

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

[CVPR 2021] Monocular depth estimation using wavelets for efficiency

Image data augmentation scheduler for albumentations transforms

Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Official implementation of paper Gradient Matching for Domain Generalization

LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

Deep Learning pipeline for motor-imagery classification.

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

A Temporal Extension Library for PyTorch Geometric

A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

[ ICCV 2021 Oral ] Our method can estimate camera poses and neural radiance fields jointly when the cameras are initialized at random poses in complex scenarios (outside-in scenes, even with less texture or intense noise )

Details about the wide minima density hypothesis and metrics to compute width of a minima

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

A medical imaging framework for Pytorch

A higher performance pytorch implementation of DeepLab V3 Plus(DeepLab v3+)

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

LBBA-boosted WSOD

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Related tags

Overview

NLP_0-project

Data collection.

Text Processing: feature extraction2

Text Processing: feature transformation and representation2

Econometric Analysis and Hypothesis Testing2

Footnotes

Owner

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

Voice control for Garry's Mod

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

[CVPR 2021] Monocular depth estimation using wavelets for efficiency

Image data augmentation scheduler for albumentations transforms

Black-Box-Tuning - Black-Box Tuning for Language-Model-as-a-Service

Official implementation of paper Gradient Matching for Domain Generalization

LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

Deep Learning pipeline for motor-imagery classification.

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

A Temporal Extension Library for PyTorch Geometric

A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

[ ICCV 2021 Oral ] Our method can estimate camera poses and neural radiance fields jointly when the cameras are initialized at random poses in complex scenarios (outside-in scenes, even with less texture or intense noise )

Details about the wide minima density hypothesis and metrics to compute width of a minima

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

A medical imaging framework for Pytorch

A higher performance pytorch implementation of DeepLab V3 Plus(DeepLab v3+)

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

LBBA-boosted WSOD

Text Processing: feature extraction²

Text Processing: feature transformation and representation²

Econometric Analysis and Hypothesis Testing²