Library to scrape and clean web pages to create massive datasets.

Overview

lazynlp

DOI License

A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.

Setup

This library uses Python 3.

  1. Clone this library and cd into the lazynlp folder:
git clone https://github.com/chiphuyen/lazynlp.git
cd lazynlp
  1. Install dependencies

pip3 install -r requirements.txt

  1. Install the library pip3 install .

If you want to uninstall the library, use:

pip3 uninstall lazynlp

How to create a massive dataset using lazynlp:

Step 1. Obtain URLs of the webpages you want to crawl

There are several major dumps of URLs available that you can use.

Reddit URLs

This is the link to all submissions to Reddit by months. You can download the raw dump and process to get the links. Keep in mind that each of these dumps is huge (100MB - 1GB).

@jcpeterson is kind enough to provide a list of deduplicated links with at least 3 karma that you can download here.

There are about 23M URLs from between 2015-06 to 2018-10, of which around 40 - 60 % are bad URLs (URLs no longer exist or aren't scraper-friendly). It means that after you've downloaded and cleaned all good URLs from this, you should have approx 10M webpages or 50GB of pure text.

Gutenberg

You can download the list of all URLs to US Gutenberg books here. There are 50K books, which convert to about 14GB of pure text.

You can also run lazynlp.get_us_gutenberg_links() to get the same list. For example, if you want to get all the Gutenberg URLs and store it in the file us_gutenberg.urls, run the following command. This might take half a day.

lazynlp.get_us_gutenberg_links('us_gutenberg.urls')

You can download the list of all URLs to Australian Gutenberg books here. There are 4k books, which convert to about 1GB of pure text.

You can also run lazynlp.get_aus_gutenberg_links() to get the same list. For example, if you want to get all the Gutenberg URLs and store it in the file aus_gutenberg.urls:

lazynlp.get_aus_gutenberg_links('aus_gutenberg.urls')

Wikipedia

You can download the Wikipedia dumps here.

Step 2. Deduplicate URLs

You don't want to download the same URL multiple times. There are two functions that help you deduplicate all URLs:

lazynlp.dedup_lines(files, outfold)

This function takes in a list of files (in each file, each line is a URLs) and deduplicate each file against all previous files. Save all the deduplicated files in outfold.

lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)

This function allows you to deduplicate a new file against all previously deduplicated files (original_files)

Step 3. Download the URLs

If you want to download each webpage separately, call:

lazynlp.download_page(link, context=None, timeout=None)

If you want to download from a file that contains a list of URLs, call:

lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])

"""

link_file:

	file contains links to webpages to crawl. Each line contains one URL.

folder:

	folder that you want to contain your downloaded pages.

timeout:

	seconds to wait for a page to respond before abandoning it.

default_skip:

	set to True if you want to automatically skip all URLs that contain domains and extensions that are known to be scraper-unfriendly or NSFW.

	You can see the list of excluded domains at lazynlp/exclude_domains.txt.

	You can see the list of excluded extensions at lazynlp/exclude_extensions.txt

You can also add your own domains and extensions to skip with domains and extensions and arguments.

In the folder:

	Each URL is downloaded into a file, indexed by the order in which it is downloaded. The first line of each file is the URL. The rest is the textual content of the page.
 	
 	index.urls contains all the URLs that have been successfully downloaded.
	
	bad.urls contains the URLs that are bad.
	
	connection.urls contains the URLs that haven't been downloaded because of connection issues.
	
	non_ascii.urls contains the URLs that haven't been downloaded because of bad encoding issues.
	
	empty.urls contains the URLs that have empty textual content.

"""

If you have a lot of URLs, you can divide the list into multiple files and call this function separately. I was able to run 40 scripts in parallel. I guess I could have parallized the code. I just found this to be easier.

Step 4. Clean the webpages

You can get rid of all HTML tags, decode utf-8 into string, transliterate foreign characters, collapse white space, replace unprintable characters, unescape HTML, etc. using methods available in lazynlp/cleaner.py.

You can also just call the following function to do most of the processing.

lazynlp.clean_page(page)

Note:

In this library, the function lazynlp.download_pages() does both the crawling and cleaning part, so the webpages you have are pure text, like this:

http://www.thecannabist.co/2017/03/02/jeff-sessions-russia-resign-democrats/74687/
Attorney general nominee Sen. Jeff Sessions, R-Ala., testifies on Capitol Hill in Washington on Jan. 10, 2017, in the first day of his confirmation hearing before the Senate Judiciary Committee. Top Democrats now say that because he misled the committee about his visits to Russia, he should resign. (Andrew Harnik, The Associated Press)

House Oversight and Government Reform Committee Chairman Jason Chaffetz, R-Utah, tweeted early Thursday that "AG Sessions should clarify his testimony and recuse himself."

Later, Sen. Rob Portman, R-Ohio, said in a statement, "Jeff Sessions is a former colleague and a friend, but I think it would be best for him and for the country to recuse himself from the DOJ Russia probe."

House Majority Leader Kevin McCarthy, R-Calif., also initially said during an appearance on MSNBC's "Morning Joe" that Sessions should bow out.

Asked whether Sessions should recuse himself in this situation, McCarthy replied "I think the trust of the American people -- you recuse yourself in these situations, yes."

McCarthy was pressed a second time about whether he was calling for Sessions to recuse himself and he confirmed that he believed the situation required a recusal.

"I think it would be easier from that standpoint, yes," McCarthy said.

But McCarthy later said his comment had been misinterpreted, telling Fox News' "Fox and Friends," "I'm not calling on him to recuse himself. I was asked on 'Morning Joe,' if he needs to recuse himself as going forward. As you just heard, Attorney General Sessions said he would recuse himself going forward -- appropriate, and that's all my answer was."

The comments from prominent Republicans follow revelations that Sessions met with the Russian ambassador during election season. Under oath in front of the Senate Judiciary Committee for his confirmation hearing in January, Sessions had said that he had not met with any Russian officials.

Senate Minority Leader Charles Schumer, D-N.Y., joined growing Democratic calls for Sessions to either resign or at least recuse himself from any investigations into Russia's meddling in U.S. elections.

"Attorney General Sessions cannot possibly lead an investigation into Russian interference in our elections or come anywhere near it. With these revelations, he may indeed become the subject of it," Schumer told reporters. "Better for the country if he resigns, but let's get an investigation going."

Because the Department of Justice should be above reproach, for the good of the country, the Attorney General should resign.

Step 5. Remove duplicated webpages

To avoid any piece of texts being over-represented, you want to only include pages that don't signicantly overlap with other pages.

To estimate the amount of overlapping of target files with certain source files, use this function:

lazynlp.estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000)

gran is the granulary of tokens: 'char' or 'word' level.

n is the n-gram.

capacity and error_rate are for the BloomFilter used.

header: number of lines of each file to skip. It's because in our format, the first line is the url

To estimate the amount of overlapping of a target file with an existing BloomFilter, use this function:

lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0)

If given a list of files, e.g. cleaned webpages, to filter out all the files that contain more than threshold overlapping with other files, use this function:

lazynlp.filter_files(files, threshold=0.5, gran='word', n=8, capacity=100000000, error_rate=1e-7, header=0, interval=1000000)

Names of all the files that are deemed duplicated are stored in dupped_files.list

Names of all the files used for the dataset are stored in clean_files.list

Some notes:

  1. 1GB of text is about 1b characters. An English word has on average 4.5 characters, or 5.5 including whitespace. So 1GB of text is about 181M words.

  2. When I ran 30 scripts in parallel, it took 3 hours to download and clean 1GB of pure text. So it'd take 5 days to get 50GB of pure text.

  3. The OpenAI dataset has 40GB, which I estimate to contain about 7-8 billion words. If you download all the webpages from the good Reddit URLs and Gutenberg books, you should have a dataset bigger than OpenAI's WebText.

  4. OpenAI, in their paper for GPT-2, didn't include Wikipedia articles for fear of overlapping. You can choose to include Wikipedia articles that have less than a certain amount of overlapping with the existing dataset using lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8.

Comments
  • License?

    License?

    Hello,

    There are legal problems with code with no license, where I work using code that has no license attached to it is outright banned.

    Would you be so kind to add some sort of license in a file?

    It would be very nice of you if it were something permissive, like MIT or Apache 2 or BSD too.

    Thank you!

    opened by mrkafk 2
  • syntax error near unexpected token

    syntax error near unexpected token

    I see a "syntax error near unexpected token `sgp.urls,'" on submitting the following command: lazynlp.download_pages(sgp.urls, text_docs, timeout = 30, default_skip = True, extensions = [], domains = [])

    Is there something wrong I am doing? sgp.urls has all the URLs, text_docs is the name of the folder to get the outputs into, the rest of the parameters as default.

    opened by vamsiuppala 2
  • Sum of n-gram counts

    Sum of n-gram counts

    Thanks for building this, really nice work!

    I was reading through the code and noticed this line https://github.com/chiphuyen/lazynlp/blob/08696976ff1b521103147e51a187e23504fe23bd/lazynlp/analytics.py#L56 Were you looking to iteratively add up the line-ngram-counts? If yes, I can help complete that and raise a PR

    Lmk

    All the best

    opened by MichaMucha 1
  • import re for line 18

    import re for line 18

    flake8 testing of https://github.com/chiphuyen/lazynlp on Python 3.7.1

    $ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

    ./lazynlp/utils.py:17:12: F821 undefined name 're'
        return re.match("^([a-z]\.)+?$", token.lower()) is not None
               ^
    1     F821 undefined name 're'
    1
    

    E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

    • F821: undefined name name
    • F822: undefined name name in __all__
    • F823: local variable name referenced before assignment
    • E901: SyntaxError or IndentationError
    • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree
    opened by cclauss 1
  • Check robot.txt and ai.txt

    Check robot.txt and ai.txt

    Hello. I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you. read_disallows(url) : takes in a url and returns the pattern object list containing all disallowed items from robots.txt of the baseUrl for the url. I've tested it by providing "https://github.com/GrayHat12" as input to the function It extracted the baseurl "https://github.com" and went on to read robots.txt using a GET request on "https://github.com/robots.txt" Then I used a regex to extract all disallowed urls. Next I converted those urls to regex strings that could be compared against any url with the same baseurl (github.com) for example : One disallowed url is : "/*/stargazers" I converted it to : "/[^/]*/stargazers" compiled it to a pattern object and added it to a disallowed list which is returned by the function.

    Now when you compare a url "https://github.com/chiphuyen/lazynlp/stargazers" with pattern ""/[^/]*/stargazers"" there will be a match found using re.match and you can choose to not crawl it.

    Hope this was explanatory enough. I didn't understand the ai.txt part in the issue though. Will be great if someone could elaborate on that. 🐰

    Sorry for any issues with my pull request. I'm new to this and am hoping someone will guide me through

    opened by GrayHat12 0
  • urllib fails without headers

    urllib fails without headers

    Hi, Thanks for this great tool.

    I noticed urllib fails with a Forbidden Request error when I call download_page on some links. You can reproduce the error by trying the code below:

    import lazynlp
    link = "https://punchng.com/"
    page = lazynlp.download_page(link, context=None, timeout=None)
    

    This raises a 403 as shown below. Screen Shot 2019-09-16 at 2 09 51 PM

    I've attempted to create a PR that adds headers to the request by default.

    opened by Olamyy 0
  • Text quality score

    Text quality score

    Have you considered adding a metric to assess the text quality of the documents, for example using the frequencies of short frequent words? (http://rolandschaefer.net/?p=78)

    opened by vanyacohen 1
  • (Also) parsing structured data while you're at it

    (Also) parsing structured data while you're at it

    One might as well extract structured data from each element of such a dataset.

    Linked data. https://5stardata.info/

    Useful features:

    • Relations to e.g. https://schema.org/Dataset (s)
    • Reified edges to other https://schema.org/ScholarlyArticle (s) indicating whether A seems to confirm or disprove B
    • URIs for columns in CSV and CSVW datasets
      • https://www.w3.org/TR/tabular-data-primer/ (CSVW)
    help wanted 
    opened by westurner 1
Releases(v1.0.0)
Owner
Chip Huyen
Developing tools and best practices for machine learning production.
Chip Huyen
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
学习强国 自动化 百分百正确、瞬间答题,分值45分

项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

lisztomania 359 Dec 30, 2022
👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Ace Attorney reddit bot 👨🏼‍⚖️ Reddit bot that turns comment chains into ace attorney scenes. You'll need to sign up for streamable and reddit and se

763 Nov 17, 2022
此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

N0el4kLs 5 Nov 19, 2021
Dailyiptvlist.com Scraper With Python

Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

1 Oct 16, 2021
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021
This is python to scrape overview and reviews of companies from Glassdoor.

Data Scraping for Glassdoor This is python to scrape overview and reviews of companies from Glassdoor. Please use it carefully and follow the Terms of

Houping 5 Jun 23, 2022
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022
Scrap-mtg-top-8 - A top 8 mtg scraper using python

Scrap-mtg-top-8 - A top 8 mtg scraper using python

1 Jan 24, 2022
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 09, 2022
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
对于有验证码的站点爆破,用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载:python3 main

47 Nov 09, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

Josué Campos 5 Nov 29, 2021
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022