Web Downloader With Python

Overview

Web Downloader

Introduction

This module will provide API to download the webpage components : html file, image file, css fil, javascript file, href link file based on the input url (the url must start with 'http' or 'https' ).

To prosses multiple URLs at the same time, The user can list all the url he wants to download in the file "urllist.txt" as shown below:

# Add the URL you want to download line by line(The url must start with 'http' or 'https' ):
# example: https://www.google.com
https://www.google.com
https://www.carousell.sg/
https://www.google.com/search?q=github&sxsrf=AOaemvJh3t5_h8H85AE8Ajbb1IMnBrRISA%3A1636698503535&source=hp&ei=hwmOYY6mHdGkqtsPq8S9sAY&iflsig=ALs-wAMAAAAAYY4Xl7GLWS16_xc2Q9XrG0p3q277DpkL&oq=&gs_lcp=Cgdnd3Mtd2l6EAEYADIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzINCC4QxwEQowIQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJzIHCCMQ6gIQJ1AAWABgjgdoAXAAeACAAQCIAQCSAQCYAQCwAQo&sclient=gws-wiz
https://stackoverflow.com/questions/66022042/how-to-let-kubernetes-pod-run-a-local-script/66025424

Program Setup

Development Environment : python 3.7.4
Additional Lib/Software Need
  1. beautifulsoup4 4.10.0

    install:

    pip install beautifulsoup4
    

    Lib link: https://pypi.org/project/beautifulsoup4/

Hardware Needed : None
Program File List

version: v0.1

Program File Execution Env Description
webDownload.py python 3 Main executable program use the API.
urllist.txt url record list.

Program Usage

Module API Usage
  1. Downloader init:
soup = urlDownloader(imgFlg=True, linkFlg=True, scriptFlg=True)
  • imgFlg: Set to "True" to download all the "" tag files.
  • linkFlg: Set to "True" to download all the html section, image, icon, css file imported by ""
  • scriptFlg: set to "True" to download all the js file.
  1. Call API method savePage to scape url and save the data in a folder

    soup.savePage('
         
          ', '
          
           ')
    
    # Exampe:
    soup.savePage('https://www.google.com', 'www_google_com')
    
          
         
Program Execution
  1. Copy the url you want to check in the url record file "urllist.txt"

  2. Cd to the program folder and run program execution cmd:

    python webDownload.py
    
  3. Check the result:

    For example, if you copy the url "https://www.carousell.sg/" as the first url you want to check into the file "urllist.txt" file, all the html files, image file and js files will be under folder "1_www.carousell.sg_files"

    • The main web page will be saved as: "1_www.carousell.sg_files/1_www.carousell.sg.html"
    • The image used in the page will be saved in folder: "1_www.carousell.sg_files/img"
    • The html/imge/css import by href will be saved in folder: "1_www.carousell.sg_files/link"
    • The js file used by the page will be saved in fodler: "1_www.carousell.sg_files/script"

Problem and Solution

Problem[0]: Files download got slight different

Why there is a slight different between the files which download by using the program and the files which downlaod I use some-webBrowser's "page save as " for the same URL such as www.google.com

OS Platform : n.a

Error Message: n.a

Type: n.a

Solution:

This is normal situation, the logic of web scrape and browser display are different: if you type www.google.ccom if different people's browser, you can see the page shown on different browser are also different. This is because the browser cache, token in the local storage , cookie will make influence of the "GET" request. So when different people type in the google URL in their browser, they can see their own Gmail Icon shows on the right top corner. If you remove all the cache, token in the local storage , cookie of your browser and try "page save as ", the file downloaded by "page save as " should be same as the program.

Problem[2]: Some download Image are empty

OS Platform : n.a

Error Message: n.a

Type: n.a

Solution:

If a web use third party's storage to save the image and the net-storage need to authorization before download, our program download request will be reject and got 'null' when download the file. Then the saved image will be empty.


Last edit by LiuYuancheng([email protected]) at 13/11/2021

Jocomol 16 Dec 12, 2022
Code for "Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions"

Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions Codebase for the "Adversarial Motion Priors Make Good Substitutes for Com

Alejandro Escontrela 54 Dec 13, 2022
A script that downloads YouTube videos/audio

YouTube-Downloader A script that downloads YouTube videos/audio from youtube. Usage Download the script by executing the following in your terminal :

Debayan Sarkar 2 Jan 04, 2022
A tool to make easy to search for directories in the URL.

Welcome to Brutos Directory Scanner ๐Ÿš€ The Brutos is a python script used to provide agility in obtaining verifications to informations about related

Sรฉrgio Corrรชa 4 Apr 14, 2022
Tool To download Amazon 4k SDR HDR 1080, CDM IS Not Included

WV-AMZN-4K-RIPPER Tool To download Amazon 4k SDR HDR 1080, CDM IS Not Included For CDM You can Mail :- Denis Trunov 179 Dec 17, 2022

Animoo - Python scraper made with BeautifulSoup4 that scrapes images from /c/.

Animoo - Python scraper made with BeautifulSoup4 that scrapes images from /c/. Features Scrapes 10 pages Scrapes each thread Downloads all the images

aether 1 Dec 29, 2021
Simple avogadr.io batch downloader python script

Simple avogadr.io batch downloader python script

2 Jan 19, 2022
Download Apple Music Cover Artwork in the best Quality by providing an Apple Music Link. It downloads the jpg, png and webp version since they often differ from another.

amogus.py - Version 0.0.5 amogus - Apple Music Hi-Res Artwork Fetcher this is my first real python tool so sorry if its bad amogus is a Python script

reaper 46 Jan 09, 2023
Music, Album and Playlist downloader for JioSaavn

jiosaavn-dl Music, Album and Playlist downloader for JioSaavn Features Downloads tracks, albums and playlists in maximum available quality (320kbps AA

bunny 19 Dec 12, 2022
TikTok downloader video without watermark from Telegram bot

โฌ‡๏ธ How to download video from Tik Tok via telegram bot? Send a link to the video from tik tok to our telegram bot and it will send you a video without

1 Mar 04, 2022
Fetch papers and metadata.

Fetch PubMed Central for open-access papers as well as Sci-Hub

4 Oct 31, 2022
the best video downloader for terminals (currently only compatible with Linux and Windows)

the best video downloader for terminals (currently only compatible with Linux and Windows)

Amaral 2 Oct 14, 2021
Application Updater using an download link

Application-Updater This tool will update your app using an storage link

ExtremeDev 1 Dec 20, 2021
This simple Python script allows you to download songs on Telegram๐ŸŒธโค๏ธ๐Ÿ˜

SongsDownloaderTgBot ๐Ÿ“บ YouTube Song Downloader Bot For Telegram ๐Ÿ”ฎ 3X Fast Telethon Based Bot โšœ Open Source Bot ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป Demo : ๐—”๐—ป๐—ป๐—ถ๐—ฒ - ๐—˜๐—น๐—ถ๐˜‡?

Sehath Perera 23 Dec 03, 2022
Arxiv2Kindle is a simple script written in python that converts LaTeX source downloaded from Arxiv and recompiles it to better fit a Kindle or other similar reading devices.

Arxiv2Kindle is a simple script written in python that converts LaTeX source downloaded from Arxiv and recompiles it to better fit a read

Soumik Rakshit 8 Jul 09, 2022
Youtube Downloader by PyTube รฉ uma ferramenta simples com interface grรกfica e escrito em python para baixar vรญdeos e playlists do youtube...

YouTube Downloader by PyTube O que รฉ o YouTube Downloader by PyTube? YouTube Downloader by PyTube รฉ um software simples para baixar vรญdeos no YouTube

Elizeu Barbosa Abreu 5 Jul 30, 2022
A Simple YouTube Video Downloader With Python

Simple YouTube Video Downloader Simple YouTube Video Downloader is an open source project with a very simple UI that tries to speed up the process of

Brian Han 2 Jan 03, 2022
Libretrofuzz - Fuzzy Retroarch thumbnail downloader

Fuzzy Retroarch thumbnail downloader In Retroarch, when you use the manual scann

8 Nov 26, 2022
๐ด ๐‘ก๐‘’๐‘™๐‘’๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘๐‘œ๐‘ก ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘๐‘Ž๐‘› ๐‘‘๐‘œ๐‘ค๐‘›๐‘™๐‘œ๐‘Ž๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ ๐‘Ž๐‘›๐‘‘ ๐‘Ž๐‘ข๐‘‘๐‘–๐‘œ ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘ฆ๐‘œ๐‘ข๐‘ก๐‘ข๐‘๐‘’ ๐‘Ž๐‘›๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ ๐‘ค๐‘’๐‘๐‘ ๐‘–๐‘ก๐‘’๐‘  ๐‘ž๐‘ข๐‘–๐‘๐‘˜๐‘™๐‘ฆ

๐ด ๐‘ก๐‘’๐‘™๐‘’๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘๐‘œ๐‘ก ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘๐‘Ž๐‘› ๐‘‘๐‘œ๐‘ค๐‘›๐‘™๐‘œ๐‘Ž๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ ๐‘Ž๐‘›๐‘‘ ๐‘Ž๐‘ข๐‘‘๐‘–๐‘œ ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘ฆ๐‘œ๐‘ข๐‘ก๐‘ข๐‘๐‘’ ๐‘Ž๐‘›๐‘‘ ๐‘ฃ๐‘–๐‘‘๐‘’๐‘œ ๐‘ค๐‘’๐‘๐‘ ๐‘–๐‘ก๐‘’๐‘  ๐‘ž๐‘ข๐‘–๐‘๐‘˜๐‘™๐‘ฆ

SOCIAL MECHANIC 2 Aug 04, 2022