robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png https://travis-ci.org/jmcarp/robobrowser.png?branch=master https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

  • Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

Owner
Joshua Carp
Joshua Carp
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

Evgeniy Tatarkin 61 Jun 21, 2021
A web scraper for nomadlist.com, made to avoid website restrictions.

Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

Alessio Greggi 5 Nov 24, 2022
SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

Biplov Pokhrel 3 Feb 17, 2022
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
Facebook Group Scraping Using Beautiful Soup & Selenium

Extract Facebook group posts that are related to a specific topic and write them to a .json file.

Fatima Ghadieh 14 Aug 12, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper GDev Profile Badge Scraper is a Google Developer Profile Web Scraper which scrapes for specific badges in a use

Siddhant Lad 7 Jan 10, 2022
An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 06, 2022
A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Annex Bubt Scraping Script I think this is the first public repository that provides free annex-BUBT, BUBT-Soft, and BUBT website scraping API script

Md Imam Hossain 4 Dec 03, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Dailyiptvlist.com Scraper With Python

Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

1 Oct 16, 2021
Telegram group scraper tool

Telegram Group Scrapper

Wahyusaputra 2 Jan 11, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022