Modern robots.txt Parser for Python

Related tags

Miscellaneousreppy
Overview

Robots Exclusion Protocol Parser for Python

Build Status

Robots.txt parsing in Python.

Goals

  • Fetching -- helper utilities for fetching and parsing robots.txts, including checking cache-control and expires headers
  • Support for newer features -- like Crawl-Delay and Sitemaps
  • Wildcard matching -- without using regexes, no less
  • Performance -- with >100k parses per second, >1M URL checks per second once parsed
  • Caching -- utilities to help with the caching of robots.txt responses

Installation

reppy is available on pypi:

pip install reppy

When installing from source, there are submodule dependencies that must also be fetched:

git submodule update --init --recursive
make install

Usage

Checking when pages are allowed

Two classes answer questions about whether a URL is allowed: Robots and Agent:

from reppy.robots import Robots

# This utility uses `requests` to fetch the content
robots = Robots.fetch('http://example.com/robots.txt')
robots.allowed('http://example.com/some/path/', 'my-user-agent')

# Get the rules for a specific agent
agent = robots.agent('my-user-agent')
agent.allowed('http://example.com/some/path/')

The Robots class also exposes properties expired and ttl to describe how long the response should be considered valid. A reppy.ttl policy is used to determine what that should be:

from reppy.ttl import HeaderWithDefaultPolicy

# Use the `cache-control` or `expires` headers, defaulting to a 30 minutes and
# ensuring it's at least 10 minutes
policy = HeaderWithDefaultPolicy(default=1800, minimum=600)

robots = Robots.fetch('http://example.com/robots.txt', ttl_policy=policy)

Customizing fetch

The fetch method accepts *args and **kwargs that are passed on to requests.get, allowing you to customize the way the fetch is executed:

robots = Robots.fetch('http://example.com/robots.txt', headers={...})

Matching Rules and Wildcards

Both * and $ are supported for wildcard matching.

This library follows the matching 1996 RFC describes. In the case where multiple rules match a query, the longest rules wins as it is presumed to be the most specific.

Checking sitemaps

The Robots class also lists the sitemaps that are listed in a robots.txt

# This property holds a list of URL strings of all the sitemaps listed
robots.sitemaps

Delay

The Crawl-Delay directive is per agent and can be accessed through that class. If none was specified, it's None:

# What's the delay my-user-agent should use
robots.agent('my-user-agent').delay

Determining the robots.txt URL

Given a URL, there's a utility to determine the URL of the corresponding robots.txt. It preserves the scheme and hostname and the port (if it's not the default port for the scheme).

# Get robots.txt URL for http://[email protected]:8080/path;params?query#fragment
# It's http://example.com:8080/robots.txt
Robots.robots_url('http://[email protected]:8080/path;params?query#fragment')

Caching

There are two cache classes provided -- RobotsCache, which caches entire reppy.Robots objects, and AgentCache, which only caches the reppy.Agent relevant to a client. These caches duck-type the class that they cache for the purposes of checking if a URL is allowed:

from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')

from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')

Like reppy.Robots.fetch, the cache constructory accepts a ttl_policy to inform the expiration of the fetched Robots objects, as well as *args and **kwargs to be passed to reppy.Robots.fetch.

Caching Failures

There's a piece of classic caching advice: "don't cache failures." However, this is not always appropriate in certain circumstances. For example, if the failure is a timeout, clients may want to cache this result so that every check doesn't take a very long time.

To this end, the cache module provides a notion of a cache policy. It determines what to do in the case of an exception. The default is to cache a form of a disallowed response for 10 minutes, but you can configure it as you see fit:

# Do not cache failures (note the `ttl=0`):
from reppy.cache.policy import ReraiseExceptionPolicy
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=0))

# Cache and reraise failures for 10 minutes (note the `ttl=600`):
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=600))

# Treat failures as being disallowed
cache = AgentCache(
    'my-user-agent',
    cache_policy=DefaultObjectPolicy(ttl=600, lambda _: Agent().disallow('/')))

Development

A Vagrantfile is provided to bootstrap a development environment:

vagrant up

Alternatively, development can be conducted using a virtualenv:

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Tests

Tests may be run in vagrant:

make test

Development

Environment

To launch the vagrant image, we only need to vagrant up (though you may have to provide a --provider flag):

vagrant up

With a running vagrant instance, you can log in and run tests:

vagrant ssh
make test

Running Tests

Tests are run with the top-level Makefile:

make test

PRs

These are not all hard-and-fast rules, but in general PRs have the following expectations:

  • pass Travis -- or more generally, whatever CI is used for the particular project
  • be a complete unit -- whether a bug fix or feature, it should appear as a complete unit before consideration.
  • maintain code coverage -- some projects may include code coverage requirements as part of the build as well
  • maintain the established style -- this means the existing style of established projects, the established conventions of the team for a given language on new projects, and the guidelines of the community of the relevant languages and frameworks.
  • include failing tests -- in the case of bugs, failing tests demonstrating the bug should be included as one commit, followed by a commit making the test succeed. This allows us to jump to a world with a bug included, and prove that our test in fact exercises the bug.
  • be reviewed by one or more developers -- not all feedback has to be accepted, but it should all be considered.
  • avoid 'addressed PR feedback' commits -- in general, PR feedback should be rebased back into the appropriate commits that introduced the change. In cases, where this is burdensome, PR feedback commits may be used but should still describe the changed contained therein.

PR reviews consider the design, organization, and functionality of the submitted code.

Commits

Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.

  • updating / upgrading dependencies -- this is especially true for invocations like bundle update or berks update.
  • introducing a new dependency -- often preceeded by a commit updating existing dependencies, this should only include the changes for the new dependency.
  • refactoring -- these commits should preserve all the existing functionality and merely update how it's done.
  • utility components to be used by a new feature -- if introducing an auxiliary class in support of a subsequent commit, add this new class (and its tests) in its own commit.
  • config changes -- when adjusting configuration in isolation
  • formatting / whitespace commits -- when adjusting code only for stylistic purposes.

New Features

Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).

Bug Fixes

In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.

Tagging and Versioning

Whenever the version included in setup.py is changed (and it should be changed when appropriate using http://semver.org/), a corresponding tag should be created with the same version number (formatted v ).

git tag -a v0.1.0 -m 'Version 0.1.0

This release contains an initial working version of the `crawl` and `parse`
utilities.'

git push --tags origin
How to build an Fahrenheit to Celsius Converter in Python

Generally to measure the temperature we make use of one of these two popular units i.e. Fahrenheit & Celsius.

PyLaboratory 0 Feb 07, 2022
Find the remote website version based on a git repository

versionshaker Versionshaker is a tool to find a remote website version based on a git repository This tool will help you to find the website version o

Orange Cyberdefense 110 Oct 23, 2022
Python implementation of an automatic parallel parking system in a virtual environment, including path planning, path tracking, and parallel parking

Automatic Parallel Parking: Path Planning, Path Tracking & Control This repository contains a python implementation of an automatic parallel parking s

134 Jan 09, 2023
CDM Device Checker for python

CDM Device Checker for python

zackmark29 79 Dec 14, 2022
*考研学习利器,玩电脑控制不住自己时,可以使用该程序定日期锁屏,同时有精美壁纸锁屏显示,也不会枯燥。

LockscreenbyTime_win10 A python program in win10. You can set the time to lock the computer(by setting year, month, day), Fullscreen pictures will sho

PixianDouban 4 Jul 10, 2022
通过简单的卷积神经网络直接预测出验证码图片中滑块的位置

使用说明 1. 在本地测试 运行python3 prdict_one.py即可,默认需要预测的图片路径位于testImg文件夹下的test1.png 运行python3 predict_folder.py预测testImg下的所有图片 2. 部署到服务器 运行python3 run_a_server

12 Mar 08, 2022
Demo scripts for the Kubernetes Security Webinar

Kubernetes Security Webinar [in Russian] YouTube video (October 13, 2021) Authors: Artem Yushkovsky (LinkedIn, GitHub) Maxim Mosharov @ Whitespots.io

Slurm 34 Dec 06, 2022
How did Covid affect businesses?

NYC_Business_Analysis How did Covid affect businesses? COVID's effect on NYC businesses We all know that businesses in NYC have been affected by COVID

AK 1 Jan 15, 2022
Procscan is a quick and dirty python script used to look for potentially dangerous api call patterns in a Procmon PML file.

PROCSCAN Procscan is a quick and dirty python script used to look for potentially dangerous api call patterns in a Procmon PML file. Installation git

Daniel Santos 9 Sep 02, 2022
The parser of a timetable of tennis matches for Flashscore website

FlashscoreParser The parser of a timetable of tennis matches for Flashscore website. The program collects the schedule of tennis matches for two days

Valendovsky 1 Jul 15, 2022
Data and analysis relating to the 5.8M Melbourne quake of 2021

quake2021 Data and analysis relating to the 5.8M Melbourne quake of 2021 Monash University Woodside Living Lab Building The building is located here T

Colin Caprani 6 May 16, 2022
Improving Representations via Similarities

embetter warning I like to build in public, but please don't expect anything yet. This is alpha stuff! notes Improving Representations via Similaritie

vincent d warmerdam 229 Jan 08, 2023
A webapp for taking fast notes, designed for business, school, and collaboration with groups.

JOTS Journal of the Session A webapp for taking fast notes, designed for business, school, and collaboration with groups.

Zebadiah S. Taylor 2 Jun 10, 2022
Python solutions to Codeforces problems

CodeForces This repository is dedicated to my Python solutions for CodeForces problems. Feel free to copy, contribute and/or comment. If you find any

Shukur Sabzaliev 15 Dec 20, 2022
A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT project.

circt-hls What is this?: A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT p

29 Dec 14, 2022
A basic tic tac toe game on python!

A basic tic tac toe game on python!

Shubham Kumar Chandrabansi 1 Nov 18, 2021
A Lynx that manages a group that puts the federation first.

Lynx Super Federation Management Group Lynx was created to manage your groups on telegram and focuses on the Lynx Federation. I made this to root out

Unknown 2 Nov 01, 2022
A python script developed to process Windows memory images based on triage type.

Overview A python script developed to process Windows memory images based on triage type. Requirements Python3 Bulk Extractor Volatility2 with Communi

CrowdStrike 245 Nov 24, 2022
Structural basis for solubility in protein expression systems

Structural basis for solubility in protein expression systems Large-scale protein production for biotechnology and biopharmaceutical applications rely

ProteinQure 16 Aug 18, 2022
A simple bot that will help you in your learning and make it more fun.

hyperskill-SimpleChattyBot-python A simple bot that will help you in your learning and make it more fun. Syntax bot.py Stages Stage #1: Zuhura Bot we

1 Nov 09, 2021