Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Overview

Parsel

Build Status PyPI Version Coverage report

Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with regular expressions.

Find the Parsel online documentation at https://parsel.readthedocs.org.

Example (open online demo):

>>> from parsel import Selector
>>> selector = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul>
        </body>
        </html>""")
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
Comments
  • Add attributes dict accessor

    Add attributes dict accessor

    It's often useful to get the attributes of the underlying elements, and Parsel currently doesn't make it obvious how to do that.

    Currently, the first instinct is to use XPath, which makes it a bit awkward because you need a trick like the name(@*[i]) described in this blog post.

    This PR proposes adding two methods .attrs() and .attrs_all() (mirroring get and getall) for getting attributes in a way that more or less made sense to me. What do you think?

    opened by eliasdorneles 18
  • [MRG+1] Add has-class xpath extension function

    [MRG+1] Add has-class xpath extension function

    This is a POC implementation of what's been discussed in #13

    The benchmark extended to include this implementation of has-class is available here. To summarize its results, throughout this morning I have seen custom xpath functions (except the old has-class) implementations taking 150-200% of the css->xpath approach's time in the first test, and the last case, sel.xpath('//*[has-class("story")]//a'), took only 60% of sel.css('.story').xpath('.//a') time consistenly.

    But right now I'm seeing a different situation:

    $ python bench.py 
    sel.css(".story")                                                       0.654  1.000
    sel.xpath("//*[has-class-old('story')]")                               12.256 18.737
    sel.xpath("//*[has-class-set('story')]")                                1.907  2.915
    sel.xpath("//*[has-one-class('story')]")                                1.715  2.623
    sel.xpath("//*[has-class-plain('story')]")                              1.770  2.706
    
    
    sel.css("article.story")                                                0.201  1.000
    sel.xpath("//article[has-class-old('story')]")                          1.219  6.072
    sel.xpath("//article[has-class-set('story')]")                          0.314  1.566
    sel.xpath("//article[has-one-class('story')]")                          0.292  1.454
    sel.xpath("//article[has-class-plain('story')]")                        0.299  1.490
    
    
    sel.css("article.theme-summary.story")                                  0.192  1.000
    sel.xpath("//article[has-class-old('theme-summary', 'story')]")         1.288  6.699
    sel.xpath("//article[has-class-set('theme-summary', 'story')]")         0.266  1.384
    sel.xpath("//article[has-class-plain('theme-summary', 'story')]")       0.247  1.284
    
    
    sel.css(".story").xpath(".//a")                                         1.798  1.000
    sel.xpath("//*[has-class-old('story')]//a")                            11.995  6.671
    sel.xpath("//*[has-class-set('story')]//a")                             1.944  1.081
    sel.xpath("//*[has-one-class('story')]//a")                             1.747  0.972
    sel.xpath("//*[has-class-plain('story')]//a")                           1.778  0.989
    
    opened by immerrr 16
  • [MRG+1] Correct build/test fail with no module names 'tests'

    [MRG+1] Correct build/test fail with no module names 'tests'

    When working on the Debian package of parsel, I got the following error, due to the 'tests' module not being found. This commit corrects it.

    Traceback (most recent call last):
      File "setup.py", line 54, in <module>
        test_suite='tests',
      File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
        dist.run_commands()
      File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
        cmd_obj.run()
      File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 159, in run
        self.with_project_on_sys_path(self.run_tests)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 140, in with_project_on_sys_path
        func()
      File "/usr/lib/python2.7/dist-packages/setuptools/command/test.py", line 180, in run_tests
        testRunner=self._resolve_as_ep(self.test_runner),
      File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
        self.parseArgs(argv)
      File "/usr/lib/python2.7/unittest/main.py", line 149, in parseArgs
        self.createTests()
      File "/usr/lib/python2.7/unittest/main.py", line 158, in createTests
        self.module)
      File "/usr/lib/python2.7/unittest/loader.py", line 130, in loadTestsFromNames
        suites = [self.loadTestsFromName(name, module) for name in names]
      File "/usr/lib/python2.7/unittest/loader.py", line 91, in loadTestsFromName
        module = __import__('.'.join(parts_copy))
    ImportError: No module named tests
    E: pybuild pybuild:274: test: plugin distutils failed with: exit code=1: python2.7 setup.py test 
    dh_auto_test: pybuild --test -i python{version} -p 2.7 returned exit code 13
    

    Cheers

    opened by ghantoos 15
  • Added text_content() method to selectors.

    Added text_content() method to selectors.

    I recently needed to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.

    While that wasn't hard, it is a common operations that should be built into scrapy

    opened by paulo-raca 13
  • updating classifiers in setup.py

    updating classifiers in setup.py

    Hey fellows,

    This fixes the pre-alpha classifier (see #18) and also add Topic :: Text Processing :: Markup.

    We can consider it stable since v1.0, Scrapy is even depending on the current API already.

    Also, maybe we could add the more specific Topic :: Text Processing :: Markup :: HTML and Topic :: Text Processing :: Markup :: XML too (see list of classifiers here).

    What do you think?

    Thanks!

    opened by eliasdorneles 13
  • Docstrings and autodocs for API reference

    Docstrings and autodocs for API reference

    Hey, fellows!

    Here is a change to use docstrings + autodocs for the API reference, making it easier to keep it in sync with the code. This fixes #16

    Does this look good?

    opened by eliasdorneles 12
  • [MRG+1] exception handling was hiding original exception

    [MRG+1] exception handling was hiding original exception

    Any xpath error was caught and reraised as a ValueError complaining about an Invalid XPath, quoting the original xpath for debugging purposes.

    First, "Invalid XPath" is misleading because the same exception is also raised for xpath evaluation errors. However it also hides the original exception message which ends up making xpath debugging harder. I made it quote the original exception message too which can be "Unregisted function", "Invalid Expression", "Undefined namespace prefix" etc.

    Now before merging: Any exception can occur during xpath evaluation because any python function can be registered and called in an xpath. I doubt there's anyone wrapping xpath method calls in user code in another try/except blocks for "ValueError". Even if somebody actually does this, I bet it's for logging some custom error message and this can't justify the usefulness of the current try/except block. That's why I'm leaning more towards dropping the try/except altogether. However I opened this PR instead because I doubt you'd accept dropping it.

    opened by Digenis 11
  • SelectorList re_first and regex optimizations

    SelectorList re_first and regex optimizations

    I implemented https://github.com/scrapy/parsel/issues/52 and also did some optimization on the regex functions. I added pytest-benchmark to the project and created benchmarking tests to cover the regex stuff.

    It probably needs to be cleaned up a bit before merge. Also I'm not sure if you want to include benchmarks in your project. In that case I can create a branch without the benchmarks.

    Running the benchmarks

    To run the benchmarks with tox:

    tox -e benchmark
    

    You can also run the benchmarks with py.test directly in order to e.g. compare results.

    py.test --benchmark-only --benchmark-max-time=5
    

    Speedup

    I ran my tests by comparing the following branches. https://github.com/Tethik/parsel/tree/re_benchmark_tests https://github.com/Tethik/parsel/tree/re_opt_with_benchmarks (source of the pull request)

    git checkout re_benchmark_tests
    py.test --benchmark-only --benchmark-max-time=20 --benchmark-save=without
    

    Then compared with the opt branch.

    git checkout re_opt_with_benchmarks
    py.test --benchmark-only --benchmark-max-time=20 --benchmark-compare=<num_without>
    

    Sample benchmark results found in the following gist. "NOW" is the optimized version and "0014" is the current version with a naïve re_first implementation. https://gist.github.com/Tethik/8885c5c349c8922467b31a22078baf48

    Loosely interpreting the results I get up to 2.5 speedup on the re_first function, but also a smaller improvement on the re function.

    opened by Tethik 10
  • [MRG+1] Fix has-class to deal with newlines in class names

    [MRG+1] Fix has-class to deal with newlines in class names

    The has-class() XPath function fails to select elements by class when there's a \n character right after the class name we're looking for. For example:

    >>> import parsel
    >>> html = '''<p class="foo
    bar">Hello</p>
    '''
    >>> parsel.Selector(text=html).xpath('//p[has-class("foo")]/text()').get() is None
    True
    

    Such (broken?) elements are not that uncommon around the web and they break has-class expected behavior (at least from my point of view).

    Any thoughts on it?

    opened by stummjr 9
  • Caching css_to_xpath()'s recently used patterns to improve efficiency

    Caching css_to_xpath()'s recently used patterns to improve efficiency

    I profiled the scrapy-bench spider which uses response.css() for extracting information.

    The profiling results are here. The function css_to_xpath() takes 5% of the total time.

    When response.xpath()(profiling result) was used, the items extracted per second (benchmark result) was higher.

    Hence, I'm proposing caching for the recently used patterns, so that the function takes lesser time. I'm working on a prototype for the same and will add the results for it too.

    opened by Parth-Vader 9
  • Add .get() and .getall() aliases

    Add .get() and .getall() aliases

    In quite a few projects, I've been using a .get() alias to .extract_first() as most of the time, getting the first match is what I want. To me, .extract_first() feels a bit long to write (I'm probably getting lazy with age...)

    For cases where I do need to loop on results, I added a .getall() alias for .extract() on .xpath() and .css() calls results.

    I know there's been quite some discussion already to have .extract_first() in the first place, but I'm submitting my preference again.

    opened by redapple 9
  • Improve typing in parsel.selector._ctgroup

    Improve typing in parsel.selector._ctgroup

    parsel.selector._ctgroup, used to switch between mode implementations, is an untyped dict of dicts, it makes sense to change it into something cleaner as it's a private var.

    enhancement 
    opened by wRAR 1
  • Modernize SelectorList-related code

    Modernize SelectorList-related code

    There were multiple issues identified with the code of SelectorList itself, Selector.selectorlist_cls and their typing. Ideally:

    • SelectorList should only be able to contain Selector objects
    • SelectorList subclasses made to work with Selector subclasses should only able to contain those
    • Selector subclasses shouldn't need to set selectorlist_cls to a respective SelectorList subclass manually
    • all of this should be properly typed without need for casts and other overrides

    This may require changing Selector and/or SelectorList base classes, but I think we will need to keep the API compatibility? It's also non-trivial because the API for subclassing them doesn't seem to be documented, the only reference is SelectorTestCase.test_extending_selector() (the related code was also changed when adding typing, not sure if it changed the interface).

    enhancement 
    opened by wRAR 0
  • Adding a `strip` kwarg to `get()` and `getall()`

    Adding a `strip` kwarg to `get()` and `getall()`

    Hi,

    Thank you very much for this excellent library ❤️

    I've been using Parsel for a while and I constantly find myself calling .strip() after .get() or .getall(). I think it would be very helpful if Parsel provided a built-in mechanism for that.

    I suggest adding a strip kwarg to get() and getall(). It would be a boolean value, and when it's true, Parsel would call strip() on every match.

    Example with get():

    # Before
    author = selector.css("[itemprop=author] [itemprop=name]::text").get()
    if author:
       author = author.strip()
    
    # After
    author = selector.css("[itemprop=author] [itemprop=name]::text").get(strip=True)
    

    Example with getall():

    # Before
    authors = [author.strip() for author in selector.css("[itemprop=author] [itemprop=name]::text").getall()]
    
    # After
    authors = selector.css("[itemprop=author] [itemprop=name]::text").getall(strip=True)
    

    Alternatively, we could change the ::text pseudo-element to support an argument, like ::text(strip=1). That would be extremely handy too and probably more flexible than my original suggestion, but also more difficult to implement.

    I know I could strip whitespaces with re() and re_first() but it's overkill and hides the intent.

    Best regards, Benoit

    enhancement 
    opened by bblanchon 0
  • Discussion on implementing selectolax support

    Discussion on implementing selectolax support

    Here are some of the changes I thought of implementing

    High level changes -

    1. Selector class takes a new argument "parser" which indicates which parser backend to use (lxml or selectolax).
    2. Selectolax itself provides two backends Lexbor and Modest by default it uses the Modest backend. Should additional support for lexbor be added? We could use modest by default and have the users pass an argument if they want to use lexbor
    3. If the "parser" argument is not provided lxml will be used by default, since I thought it preserves the current behavior and allows backward support. It also allows the test suite to be used without changes to all the existing methods.
    4. If the xpath method is called on a selector instantiated with selectolax as parser raise NotImplementedError.

    Low level changes -

    1. Add selectolax to the list of parsers in _ctgroup and modify create_root_node to instantiate the selected parser with the provided data.
    2. Modify the xpath and css methods behavior to use both selectolax and lxml or write separate methods or classes to handle them.
    3. Utilize HTMLParser class in Selectolax and its css method to apply the css expression specified and return the data collected.
    4. Create a Selectorlist with Selector objects created with the type and parser specified.

    This is still a work in progress and I will make a lot of changes, Please suggest the changes that need to made to the current list

    opened by deepakdinesh1123 7
  • Support JSONPath

    Support JSONPath

    Support for JSONPath has been added with the jsonpath-ng library. Most of the implementation is based on #181 which adds support for json using the JMESPath library.

    closes #204

    opened by deepakdinesh1123 5
Releases(v1.7.0)
  • v1.7.0(Nov 1, 2022)

    • Add PEP 561-style type information
    • Support for Python 2.7, 3.5 and 3.6 is removed
    • Support for Python 3.9-3.11 is added
    • Very large documents (with deep nesting or long tag content) can now be parsed, and Selector now takes a new argument huge_tree to disable this
    • Support for new features of cssselect 1.2.0 is added
    • The Selector.remove() and SelectorList.remove() methods are deprecated and replaced with the new Selector.drop() and SelectorList.drop() methods which don’t delete text after the dropped elements when used in the HTML mode.
    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(May 7, 2020)

    • Python 3.4 is no longer supported
    • New Selector.remove() and SelectorList.remove() methods to remove selected elements from the parsed document tree
    • Improvements to error reporting, test coverage and documentation, and code cleanup
    Source code(tar.gz)
    Source code(zip)
  • v1.3.1(Dec 28, 2017)

    • has-class XPath extension function;
    • parsel.xpathfuncs.set_xpathfunc is a simplified way to register XPath extensions;
    • Selector.remove_namespaces now removes namespace declarations;
    • Python 3.3 support is dropped;
    • make htmlview command for easier Parsel docs development.
    • CI: PyPy installation is fixed; parsel now runs tests for PyPy3 as well.

    1.3.1 was released shortly after 1.3.0 to fix pypi upload issue.

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(May 17, 2017)

    • Add get() and getall() methods as aliases for extract_first and extract respectively
    • Add default value parameter to SelectorList.re_first method
    • Add Selector.re_first method
    • Bug fix: detect None result from lxml parsing and fallback with an empty document
    • Rearrange XML/HTML examples in the selectors usage docs
    • Travis CI:
      • Test against Python 3.6
      • Test against PyPy using "Portable PyPy for Linux" distribution
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Nov 22, 2016)

    • Change default HTML parser to lxml.html.HTMLParser, which makes easier to use some HTML specific features
    • Add css2xpath function to translate CSS to XPath
    • Add support for ad-hoc namespaces declarations
    • Add support for XPath variables
    • Documentation improvements and updates
    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(Jul 29, 2016)

    • Add BSD-3-Clause license file
    • Re-enable PyPy tests
    • Integrate py.test runs with setuptools (needed for Debian packaging)
    • Changelog is now called NEWS
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Apr 26, 2016)

  • v1.0.1(Sep 3, 2015)

  • v1.0.0(Sep 3, 2015)

  • v0.9.3(Aug 7, 2015)

  • v0.9.2(Aug 7, 2015)

  • v0.9.1(Aug 7, 2015)

  • v0.9.0(Jul 30, 2015)

    Released first version of Parsel -- a library extracted out from Scrapy project that lets you extract text from XML/HTML documents using XPath or CSS selectors.

    Source code(tar.gz)
    Source code(zip)
Owner
Scrapy project
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy project
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 07, 2022
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 2 Dec 14, 2022
A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

Scrapinghub 1.2k Jan 02, 2023
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

0 Jan 07, 2022
Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

Aarif Munwar Jahan 1 Jan 04, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022
Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

Game Scraper Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms. Join the discord About The Proj

KursK 2 Mar 28, 2022
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
自动完成每日体温上报(Github Actions)

体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

Teng Zhang 23 Sep 15, 2022
Scraping Top Repositories for Topics on GitHub,

0.-Webscrapping-using-python Scraping Top Repositories for Topics on GitHub, Web scraping is the process of extracting and parsing data from websites

Dev Aravind D Satprem 2 Mar 18, 2022
Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

Karan Arora 17 Jan 24, 2022
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022