Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

Additional tools for particle accelerator data analysis and machine information

A notebook to analyze Amazon Recommendation Review Dataset.

Validation and inference over LinkML instance data using souffle

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Vaex library for Big Data Analytics of an Airline dataset

Very useful and necessary functions that simplify working with data

ICLR 2022 Paper submission trend analysis

Utilize data analytics skills to solve real-world business problems using Humana’s big data

MDAnalysis is a Python library to analyze molecular dynamics simulations.

ASOUL直播间弹幕抓取&&数据分析

Candlestick Pattern Recognition with Python and TA-Lib

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Top 50 best selling books on amazon

Python beta calculator that retrieves stock and market data and provides linear regressions.

Data pipelines built with polars

Includes all files needed to satisfy hw02 requirements

Bearsql allows you to query pandas dataframe with sql syntax.

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.