Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

A python tool to scrape NFT's off of OpenSea

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

Web Scraping images using Selenium and Python

script to scrape direct download links (ddls) from google drive index.

A web service for scanning media hosted by a Matrix media repository

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

A python script to extract answers to any question on Quora (Quora+ included)

京东抢茅台，秒杀成功很多次讨论，天猫抢购，赚钱交流等。

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Creating Scrapy scrapers via the Django admin interface

A distributed crawler for weibo, building with celery and requests.

Dex-scrapper - Hobby project for scrapping dex data on VeChain

河南工业大学完美校园自动校外打卡

Python script for crawling ResearchGate.net papers✨⭐️📎

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Collection of code files to scrap different kinds of websites.

A simple proxy scraper that utilizes the requests module in python.

让中国用户使用git从github下载的速度提高1000倍!

A simple code to fetch comments below an Instagram post and save them to a csv file

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

A python tool to scrape NFT's off of OpenSea

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

Web Scraping images using Selenium and Python

script to scrape direct download links (ddls) from google drive index.

A web service for scanning media hosted by a Matrix media repository

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

A python script to extract answers to any question on Quora (Quora+ included)

京东抢茅台，秒杀成功很多次讨论，天猫抢购，赚钱交流等。

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Creating Scrapy scrapers via the Django admin interface

A distributed crawler for weibo, building with celery and requests.

Dex-scrapper - Hobby project for scrapping dex data on VeChain

河南工业大学 完美校园 自动校外打卡

Python script for crawling ResearchGate.net papers✨⭐️📎

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Collection of code files to scrap different kinds of websites.

A simple proxy scraper that utilizes the requests module in python.

让中国用户使用git从github下载的速度提高1000倍!

A simple code to fetch comments below an Instagram post and save them to a csv file

河南工业大学完美校园自动校外打卡