A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Last update: Feb 10, 2022

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
- another scraping script extracts a list of free public proxy servers from a web site.
- with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
- during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
The extracted data is being written directly to a DataBase.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Related tags

Overview

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

Owner

Kushal Shingote

A social networking service scraper in Python

A simple python script to fetch the latest covid info

This tool crawls a list of websites and download all PDF and office documents

OSTA web scraper, for checking the status of school buses in Ottawa

Dailyiptvlist.com Scraper With Python

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Automated data scraper for Thailand COVID-19 data

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Amazon web scraping using Scrapy Framework

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

Amazon scraper using scrapy, a python framework for crawling websites.

This project was created using Python technology and flask tools to scrape a music site

Telegram Group Scrapper

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Parse feeds in Python