Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

WebScrapping Project - G1 Latest News

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

A web scraper that exports your entire WhatsApp chat history.

A high-level distributed crawling framework.

Dude is a very simple framework for writing web scrapers using Python decorators

Example of scraping a paginated API endpoint and dumping the data into a DB

Web scrapper para cotizar articulos

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Divar.ir Ads scrapper

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Binance Smart Chain Contract Scraper + Contract Evaluator

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Minimal set of tools to conduct stealthy scraping.

A social networking service scraper in Python

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

The first public repository that provides free BUBT website scraping API script on Github.

An experiment to deploy a serverless infrastructure for a scrapy project.