A scrapy pipeline that provides an easy way to store files and images using various folder structures.

Last update: Oct 23, 2022

Overview

scrapy-folder-tree

This is a scrapy pipeline that provides an easy way to store files and images using various folder structures.

Supported folder structures:

Given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg, you can choose the following folder structures:

Using file name

full
├── 0
.   ├── 5
.   .   ├── b
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg

Using crawling time

full
├── 0
.   ├── 11
.   .   ├── 48
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg

Using crawling date

full
├── 2022
.   ├── 1
.   .   ├── 24
.   .   .   ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg

Installation

pip install scrapy_folder_tree

Usage

Use the following settings in your project:

ITEM_PIPELINES = {
    'scrapy_folder_tree.FilesHashTreePipeline': 300
}

FOLDER_TREE_DEPTH = 3

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

Related tags

Overview

scrapy-folder-tree

Supported folder structures:

Installation

Usage

Owner

Panagiotis Simakis

Scraping web pages to get data

Fundamentus scrapy

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

Kusonime scraper using python3

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Iptvcrawl - A scrapy project for crawl IPTV playlist

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

A social networking service scraper in Python

Pelican plugin that adds site search capability

薅薅乐 - JD 测试脚本

This tool crawls a list of websites and download all PDF and office documents

The first public repository that provides free BUBT website scraping API script on Github.

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

A scalable frontier for web crawlers

A crawler of doubamovie

simple http & https proxy scraper and checker

Automatically download and crop key information from the arxiv daily paper.

Unja is a fast & light tool for fetching known URLs from Wayback Machine

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

Related tags

Overview

scrapy-folder-tree

Supported folder structures:

Installation

Usage

Owner

Panagiotis Simakis

Scraping web pages to get data

Fundamentus scrapy

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

Kusonime scraper using python3

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Iptvcrawl - A scrapy project for crawl IPTV playlist

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

A social networking service scraper in Python

Pelican plugin that adds site search capability

薅薅乐 - JD 测试脚本

This tool crawls a list of websites and download all PDF and office documents

The first public repository that provides free BUBT website scraping API script on Github.

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

A scalable frontier for web crawlers

A crawler of doubamovie

simple http & https proxy scraper and checker

Automatically download and crop key information from the arxiv daily paper.

Unja is a fast & light tool for fetching known URLs from Wayback Machine

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）