A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Overview

Universal Online Judge Spider

Introduction

This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/).

It also works for all other Online Judges using the UOJ system.

This spider is written in python3, using python selenium webdriver library and ChromeDriver.

It is only tested on Ubuntu 20.04, so the commands in the following section are only available for this system as well.

Features

  • Automatic login, no need to obtain cookies manually.
  • Convert pages into PDFs with reproducible text rather than simple screenshots.
  • Automatically detects the loading of MathJax to ensure that the mathematical formula within the results are displayed correctly.
  • Automatically skips pages that already exist (if the corresponding PDF file already exists locally).
  • Support for proxy.
  • Support for all websites using the UOJ system.

Installation

1. Install python3 and ChromeDriver:

apt install python3 python-pip3 chromium-browser chromium-chromedriver

2. Install selenium library for python3

pip3 install selenium

3. Download this program

Usage

Firstly you have to set these variables:

# [Basic settings]
url = ""
username = ""
password = ""
start_number = 1
end_number = 100
save_dir = "downloads"

# [Advanced settings]
proxy = ""
page_404_title = "404 - "
max_login_time = 60
max_mathjax_start_time = 60
max_mathjax_load_time = 60

Basic settings

  • url: the index URL of your target, e.g. https://uoj.ac/. Please note that the value must end in a slash /.
  • username: your username.
  • password: your password.
  • start_number: the number of the first problem crawled (minimum).
  • end_number: the number of the last problem crawled (maximum).
  • save_dir: the name of the folder where the result will be stored.

Advanced settings

If you don't know what the advanced settings are for, you're probably better not to change them.

  • proxy: the address of your proxy server, e.g. HTTP://127.0.0.1:1080, or SOCKS5://127.0.0.1:1081. Leave it blank (empty string) if you do not need to use a proxy.
  • page_404_title: the title of OJ's 404 page. You may use a substring of the title, like 404 - . If the program gets a page title that contains this string, the download of that page will be skipped.
  • max_login_time: the maximum waiting time for a login attempt, in seconds.
  • max_mathjax_start_time: the maximum wait time for a MathJax loading message to appear, in seconds.
  • max_mathjax_load_time: the maximum wait time for a MathJax loading message to disappear (i.e. MathJax rendering is finished), in seconds.

After completing the setup, run:

python3 main.py

Sample result

page1

page2

License

MIT License.

Owner
TriNitroTofu
QAQ...
TriNitroTofu
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
A Spider for BiliBili comments with a simple API server.

BiliComment A spider for BiliBili comment. Spider Usage Put config.json into config directory, and then python . ./config/config.json. A example confi

Hao 3 Jul 05, 2021
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 04, 2023
Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

Juan Manuel 2 May 09, 2022
A Pixiv web crawler module

Pixiv-spider A Pixiv spider module WARNING It's an unfinished work, browsing the code carefully before using it. Features 0004 - Readme.md updated, co

Uzuki 1 Nov 14, 2021
Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

John 12 Oct 22, 2022
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

Ben Welsh 15 Nov 24, 2022
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022
UdemyBot - A Simple Udemy Free Courses Scrapper

UdemyBot - A Simple Udemy Free Courses Scrapper

Gautam Kumar 112 Nov 12, 2022
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022