Find thumbnails and original images from URL or HTML file.

Last update: Oct 15, 2022

Related tags

Web Crawling haul

Overview

Haul

Find thumbnails and original images from URL or HTML file.

Demo

Hauler on Heroku

Installation

on Ubuntu

$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul

on Mac OS X

$ pip install haul

Fail to install haul? It is probably caused by lxml.

Usage

Find images from img src, a href and even background-image:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""

Find original (or bigger size) images with extend=True:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
    # bigger size, extended from above urls
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""

Advanced Usage

Custom finder / extender pipeline:

's data-src attribute """ now_finder_image_urls = [] for img in soup.find_all('img'): src = img.get('data-src', None) if src: src = str(src) now_finder_image_urls.append(src) output = {} output['finder_image_urls'] = finder_image_urls + now_finder_image_urls return output MY_FINDER_PIPELINE = ( 'haul.finders.pipeline.html.img_src_finder', 'haul.finders.pipeline.css.background_image_finder', img_data_src_finder, ) GOOGLE_SITES_EXTENDER_PIEPLINE = ( 'haul.extenders.pipeline.google.blogspot_s1600_extender', 'haul.extenders.pipeline.google.ggpht_s1600_extender', 'haul.extenders.pipeline.google.googleusercontent_s1600_extender', ) url = 'http://fashion-fever.nl/dressing-up/' h = Haul(parser='lxml', finder_pipeline=MY_FINDER_PIPELINE, extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE) result = h.find_images(url, extend=True)">

from haul import Haul
from haul.compat import str


def img_data_src_finder(pipeline_index,
                        soup,
                        finder_image_urls=[],
                        *args, **kwargs):
    """
    Find image URL in 's data-src attribute
    """

    now_finder_image_urls = []

    for img in soup.find_all('img'):
        src = img.get('data-src', None)
        if src:
            src = str(src)
            now_finder_image_urls.append(src)

    output = {}
    output['finder_image_urls'] = finder_image_urls + now_finder_image_urls

    return output

MY_FINDER_PIPELINE = (
    'haul.finders.pipeline.html.img_src_finder',
    'haul.finders.pipeline.css.background_image_finder',
    img_data_src_finder,
)

GOOGLE_SITES_EXTENDER_PIEPLINE = (
    'haul.extenders.pipeline.google.blogspot_s1600_extender',
    'haul.extenders.pipeline.google.ggpht_s1600_extender',
    'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)

url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
         finder_pipeline=MY_FINDER_PIPELINE,
         extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)

Run Tests

$ python setup.py test

Find thumbnails and original images from URL or HTML file.

Related tags

Overview

Haul

Demo

Installation

Usage

Advanced Usage

Run Tests

Owner

Vinta Chen

This is a python api to scrape search results from a url.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Web scrapping

Unja is a fast & light tool for fetching known URLs from Wayback Machine

👁️ Tool for Data Extraction and Web Requests.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

自动完成每日体温上报（Github Actions）

Meme-videos - Scrapes memes and turn them into a video compilations

Tool to scan for secret files on HTTP servers

爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

This script is intended to crawl license information of repositories through the GitHub API.

Scrapes proxies and saves them to a text file

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

An IpVanish Proxies Scraper

CreamySoup - a helper script for automated SourceMod plugin updates management.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

河南工业大学完美校园自动校外打卡

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Find thumbnails and original images from URL or HTML file.

Related tags

Overview

Haul

Demo

Installation

Usage

Advanced Usage

Run Tests

Owner

Vinta Chen

This is a python api to scrape search results from a url.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Web scrapping

Unja is a fast & light tool for fetching known URLs from Wayback Machine

👁️ Tool for Data Extraction and Web Requests.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

自动完成每日体温上报（Github Actions）

Meme-videos - Scrapes memes and turn them into a video compilations

Tool to scan for secret files on HTTP servers

爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

This script is intended to crawl license information of repositories through the GitHub API.

Scrapes proxies and saves them to a text file

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

An IpVanish Proxies Scraper

CreamySoup - a helper script for automated SourceMod plugin updates management.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

河南工业大学 完美校园 自动校外打卡

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

河南工业大学完美校园自动校外打卡