Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Last update: Jan 06, 2022

Related tags

Overview

Scrapy Cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Python 2.7 or 3.6: https://www.python.org/downloads/
Redis: http://redis.io
Zookeeper: https://zookeeper.apache.org
Kafka: http://kafka.apache.org

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
Scale Scrapy instances across a single machine or multiple machines
Coordinate and prioritize their scraping effort for desired sites
Persist data across scraping jobs
Execute multiple scraping jobs concurrently
Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

Build your containers (or omit --build to pull from docker hub)

docker-compose up -d --build

Tail kafka to view your future results

docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO

From another terminal, feed a request to kafka

curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'

Validate you've got data!

# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Related tags

Overview

Scrapy Cluster

Dependencies

Core Concepts

Scrapy Cluster test environment

Steps to launch the test environment:

Documentation

Branches

Owner

Hanh Pham Van

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Simply scrape / download all the media from an fansly account.

TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

A Spider for BiliBili comments with a simple API server.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

A way to scrape sports streams for use with Jellyfin.

A tool for scraping and organizing data from NewsBank API searches

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

A webdriver-based script for reserving Tsinghua badminton courts.

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Deep Web Miner Python | Spyder Crawler

学习强国自动化百分百正确、瞬间答题，分值45分

An IpVanish Proxies Scraper

Html Content / Article Extractor, web scrapping lib in Python

A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Anonymously scrapes onlinesim.ru for new usable phone numbers.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

A python script to extract answers to any question on Quora (Quora+ included)

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Related tags

Overview

Scrapy Cluster

Dependencies

Core Concepts

Scrapy Cluster test environment

Steps to launch the test environment:

Documentation

Branches

Owner

Hanh Pham Van

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Simply scrape / download all the media from an fansly account.

TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

A Spider for BiliBili comments with a simple API server.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

A way to scrape sports streams for use with Jellyfin.

A tool for scraping and organizing data from NewsBank API searches

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

A webdriver-based script for reserving Tsinghua badminton courts.

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Deep Web Miner Python | Spyder Crawler

学习强国 自动化 百分百正确、瞬间答题，分值45分

An IpVanish Proxies Scraper

Html Content / Article Extractor, web scrapping lib in Python

A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Anonymously scrapes onlinesim.ru for new usable phone numbers.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

A python script to extract answers to any question on Quora (Quora+ included)

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

学习强国自动化百分百正确、瞬间答题，分值45分