Lightweight python package for scraping and saving webpages and websites to local storage.

These details have not been verified by PyPI

Project links

Homepage

Project description

PyWebCopy © 5

Created By : Raja Tomar License : MIT Email: rajatomar788@gmail.com

Web Scraping and Saving Complete webpages and websites with python.

Web scraping and archiving tool written in Python Archive any online website and its assets, css, js and images for offilne reading, storage or whatever reasons. It's easy with pywebcopy.

Why it's great? because it -

respects robots.txt
have a single-function basic usages
lots of configuration for many custom needs
provides several scraping packages in one Objects (thanks to their original owners)
- beautifulsoup4
- lxml
- requests
- requests_html
- pyquery

Email me at rajatomar788@gmail.com of any query :)

1.1 Installation

pywebcopy is available on PyPi and is easily installable using pip

pip install pywebcopy

You are ready to go. Read the tutorials below to get started.

First steps

You should always check if the pywebcopy is installed successfully.

>>> import pywebcopy
>>> pywebcopy.__version___
5.x

Your version may be different, now you can continue the tutorial.

1.2 Basic Usages

To save any single page, just type in python console

from pywebcopy import save_webpage


save_webpage(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads'
)

To save full website (This could overload the target server, So, be careful)

from pywebcopy import save_website

save_website(
    url='http://example-site.com/index.html',
    project_folder='path/to/downloads',
)

1.2.1 Running Tests

Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package

$ python -m unittest pywebcopy.tests

1.2.2 Webpage() object

from pywebcopy import WebPage

url = 'http://example-site.com/index.html' or None
project_loc = 'path/to/downloads/folder'

wp = WebPage(url,
project_folder
default_encoding=None,
HTML=None,
**configKwargs
)

# You can choose to load the page explicitly using 
# `requests` module
wp.get(url, **requestsKwargs)

# if you want assets only
wp.save_assets()

# if you want html only
wp.save_html()

# if you want complete webpage
wp.save_complete()

BeautifulSoup methods are supported

you can also use any beautiful_soup methods on it

>>> links = wp.bs4.find_all('a')

['//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/download/other/']

LXML is completely supported

You can use any lxml methods on it. Read more about lxml at http://lxml.de/

>>> wp.lxml.xpath('//a', ..)
[<Element 'a'>,<Element 'a'>]

PyQuery is Fully supported

You can use PyQuery methods on it .Read more about pyquery at https://pythonhosted.org/pyquery/

>>> wp.pq.select(selector, ..)
...

XPath is also supported

xpath is also natively supported which retures a :class: requests_html.Element See more at https://html.python-requests.org

>>> wp.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]

You can also select only elements containing certain text

>>> wp.find('a', containing='kenneth')
[<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a'

Tutorials: sample use-cases with pywebcopy

Common Settings and Errors

`pywebcopy.exceptions.AccessError`

If you are getting pywebcopy.exceptions.AccessError Exception. then check if website allows scraping of its content.

>>> import pywebcopy
>>> pywebcopy.config['bypass_robots'] = True

# rest of your code follows..

Overwrite existing files when copying

If you want to overwrite existing files in the directory then use the over_write config key.

>>> import pywebcopy
>>> pywebcopy.config['over_write'] = True

# rest of your code follows..

Changing your project name

By default the pywebcopy creates a directory inside project_folder with the url you have provided but you can change this using the code below

>>> import pywebcopy
>>> pywebcopy.config['project_name'] = 'my_project'

# rest of your code follows..

How to - Save Single Webpage

Particular webpage can be saved easily using the following methods.

Note: if you get pywebcopy.exceptions.AccessError when running any of these code then use the code provided on later sections.

Method 1

Webpage can easily be saved using an inbuilt funtion called .save_webpage() which takes several arguments also.

>>> import pywebcopy
>>> pywebcopy.save_webpage(project_url='http://google.com', project_folder='c://Saved_Webpages/',)

# rest of your code follows..

Method 2

This use case is slightly more powerful as it can provide every functionallity of the WebPage data class.

>>> from pywebcopy import Webpage

>>> wp = WebPage('http://google.com', 'e://tests/', project_name='Google')
>>> wp.save_complete()

# This Webpage object contains every methods of the Webpage() class and thus
# can be reused for later usages.

Method 2 using Plain HTML

:New in version 4.x:

I told you earlier that Webpage object is powerful and can be manipulated in any ways.

One feature is that the raw html is now also accepted.

>>> from pywebcopy import Webpage

>>> HTML = open('test.html').read()

>>> base_url = 'http://example.com' # used as a base for downloading imgs, css, js files.
>>> project_folder = '/saved_pages/'

>>> wp = WebPage(base_url, project_folder, HTML=HTML)
>>> wp.save_webpage()

How to - Whole Websites

Use caution when copying websites as this can overload or damage the servers of the site and rarely could be illegal, so check everything before you proceed.

Method 1 -

Using the inbuilt api .save_website() which takes several arguments.

>>> import pywebcopy

>>> pywebcopy.save_website(project_url='http://localhost:8000', project_folder='e://tests/')

Method 2 -

By creating a Crawler() object which provides several other functions as well.

>>> import pywebcopy

>>> pywebcopy.config.setup_config(project_url='http://localhost:5000/', project_folder='e://tests/', project_name='LocalHost')

>>> crawler = pywebcopy.Crawler('http://localhost:5000/')
>>> crawler.crawl()

Contribution

You can contribute in many ways

reporting bugs on github repo: https://github.com/rajatomar788/pywebcopy/ or my email.
creating pull requests on github repo: https://github.com/rajatomar788/pywebcopy/
sending a thanks mail

If you have any suggestions or fixes or reports feel free to mail me :)

1.3 Configuration

pywebcopy is highly configurable.

1.3.1 Direct Call Method

To change any configuration, just pass it to the init call.

Example:

from pywebcopy.core import save_webpage

save_webpage(

    url='http://some-site.com/', # required
    download_loc='path/to/downloads/', # required

    # config keys are case-insensitive
    any_config_key='new_value',
    another_config_key='another_new_value',

    ...

    # add many as you want :)
)

1.3.2 `config.setup_config` Method

This function is changed from core.setup_config

You can manually configure every configuration by using a config.setup_config call.

import pywebcopy

url = 'http://example-site.com/index.html'
download_loc = 'path/to/downloads/'

pywebcopy.config.setup_config(url, download_loc)

# done!

>>> pywebcopy.config.config['url']
'http://example-site.com/index.html'

>>> pywebcopy.config.config['mirrors_dir']
'path/to/downloads'

>>> pywebcopy.config.config['project_name']
'example-site.com'


## You can also change any of these by just adding param to
## `setup_config` call

>>> pywebcopy.config.setup_config(url, 
        download_loc,project_name='Your-Project', ...)

## You can also change any config even after
## the `setup_config` call

pywebcopy.config.config['url'] = 'http://url-changed.com'
# rest of config remains unchanged

Done!

1.3.3 List of available `configurations`

below is the list of config keys with their default values :

# writes the trace output and log file content to console directly
'DEBUG': False  

# make zip archive of the downloaded content
'zip_project_folder': True

# delete the project folder after making zip archive of it
'delete_project_folder': False

# which parser to use when parsing pages
# for speed choose 'html.parser' (will crack some webpages)
# for exact webpage copy choose 'html5lib' (a little slow)
# or you can leave it to default 'lxml' (balanced)
'PARSER' : 'lxml'

# to download css file or not
'LOAD_CSS': True

# to download images or not
'LOAD_IMAGES': True

# to download js file or not
'LOAD_JAVASCRIPT': True


# to overwrite the existing files if found
'OVER_WRITE': False

# list of allowed file extensions
'ALLOWED_FILE_EXT': ['.html', '.css', '.json', '.js',
                     '.xml','.svg', '.gif', '.ico',
                      '.jpeg', '.jpg', '.png', '.ttf',
                      '.eot', '.otf', '.woff']

# log file path
'LOG_FILE': None

# name of the mirror project
'PROJECT_NAME': website-name.com

# define the base directory to store all copied sites data
'PROJECT_FOLDER': None


# DANGER ZONE
# CHANGE THESE ON YOUR RESPONSIBILITY
# NOTE: Do not change unless you know what you're doing

# requests headers to be shown on requests made to server
'http_headers': {
    "Accept-Language": "en-US,en;q=0.9",
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; PyWebcopyBot/{};) AppleWebKit/604.1.38 (KHTML, like Gecko) Chrome/68.0.3325.162".format(VERSION)
}

# bypass the robots.txt restrictions
'BYPASS_ROBOTS' : False

told you there were plenty of config vars available!

1.5 Undocumented Features

I built many utils and classes in this project to ease the tasks I was trying to do.

But, these task are also suitable for general purpose use.

So, if you want, you can help in generating suitable documentation for these undocumented ones, then you can always email me.

1.6 Changelog

[version 5.x]

Optimization of existing code, upto 5x speed ups in certain cases
Removed cluttering, improved readability

[version 4.x]

A complete rewrite and restructing of core functionality.

[version 2.0.0]

[changed]

core.setup_config function is changed to config.setup_config.

[added]

added utils.trace decorator, which will print function_name, args, kwargs and return value when debug config key is True.
new html-parsers ('html5lib', 'lxml') are supported for better webpages.
html-parser is now defaulted to 'lxml'. You can use any through new config.config key called parser

[fixed]

fixed issue while changing user-agent key cracked webpages. You can now use any browser's user-agent id and it will get exact same page downloaded.
fixed issue in generators.extract_css_urls which was caused by str and bytes difference in python3.
fixed issues in modules importing. (Thanks "Илья Игоревич").
added errorhandling to required functions

[version 2.0(beta)]

init function is replaced with save_webpage
three new config automation functions are added -
- core.setup_config (creates every ideal config just from url and download location)
- config.reset_config (resets the configuration to default state)
- config.update_config (manual-mode version of core.setup_config)
object structures.WebPage added
merged generators.generate_style_map and generators.generate_relative_paths to a single function generators.generate_style_map
rewrite of majority of functions
new module exceptions added

[version 1.10]

url is checked and resolved of any redirection before starting any work functions.
init vars : mirrors_dir and clean_up were fixed which cleaned the dir before the log was completely written.
init call now takes url arg by default and could raise a error when not supplied
professional looking log entries
rewritten archiving system now uses zipfile and exceptions handling to prevent errors and eventual archive corruption

[version 1.9]

more redundant code
modules are now separated based on type e.g. Core, Generators, Utils etc.
new helper functions and class structures.WebPage
Compatible with Python 2.6, 2.7, 3.6, 3.7

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

7.1

May 13, 2025

7.0.2

Apr 27, 2022

7.0.1 yanked

Oct 31, 2021

Reason this release was yanked:

AttributeError: WebPage.html_mime_types 'tuple' object attribute '__doc__' is read-only

7.0.0 yanked

Oct 31, 2021

Reason this release was yanked:

bugged

6.3.0

Apr 5, 2020

6.2.0

Mar 12, 2020

6.1.1

Dec 8, 2019

6.1.0

Dec 6, 2019

6.0.0

Jun 4, 2019

This version

5.0.1

Jan 6, 2019

4.0.1

Oct 31, 2018

4.0.0

Sep 26, 2018

4.0.0rc0 pre-release

Sep 26, 2018

2.0.3

Aug 19, 2018

2.0.1

Aug 18, 2018

2.0.0b0 pre-release

Aug 11, 2018

1.10

Aug 4, 2018

1.9

Jul 23, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcopy-5.0.1.tar.gz (34.6 kB view details)

Uploaded Jan 6, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pywebcopy-5.0.1-py2.py3-none-any.whl (33.4 kB view details)

Uploaded Jan 6, 2019 Python 2Python 3

File details

Details for the file pywebcopy-5.0.1.tar.gz.

File metadata

Download URL: pywebcopy-5.0.1.tar.gz
Upload date: Jan 6, 2019
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for pywebcopy-5.0.1.tar.gz
Algorithm	Hash digest
SHA256	`388910de9d007257e90c46adcc49652c194f2f8fe18a712e64fee5e1213bffae`
MD5	`6cfc4271b1ddfffc8fd05c9bee2fdde2`
BLAKE2b-256	`a3aeafea657ab7ac8d20f10bf55282e0c698ea2b74e82c2da07b2c8dc0750516`

See more details on using hashes here.

File details

Details for the file pywebcopy-5.0.1-py2.py3-none-any.whl.

File metadata

Download URL: pywebcopy-5.0.1-py2.py3-none-any.whl
Upload date: Jan 6, 2019
Size: 33.4 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for pywebcopy-5.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`a57f978f4ee710f61ce801d841db9f6687a4ae665181e5925d4341688f663990`
MD5	`0ccc40a1ad9da12f12ca678f15589085`
BLAKE2b-256	`5c78c73e87960d6210a07c265c359bdeab9caf2b49bdb4f635febd38996975e6`

See more details on using hashes here.

pywebcopy 5.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyWebCopy © 5

1.1 Installation

First steps

1.2 Basic Usages

1.2.1 Running Tests

1.2.2 Webpage() object

BeautifulSoup methods are supported

LXML is completely supported

PyQuery is Fully supported

XPath is also supported

You can also select only elements containing certain text

Tutorials: sample use-cases with pywebcopy

Common Settings and Errors

pywebcopy.exceptions.AccessError

Overwrite existing files when copying

Changing your project name

How to - Save Single Webpage

Method 1

Method 2

Method 2 using Plain HTML

How to - Whole Websites

Method 1 -

Method 2 -

Contribution

1.3 Configuration

1.3.1 Direct Call Method

1.3.2 config.setup_config Method

1.3.3 List of available configurations

1.5 Undocumented Features

1.6 Changelog

[version 5.x]

[version 4.x]

[version 2.0.0]

[changed]

[added]

[fixed]

[version 2.0(beta)]

[version 1.10]

[version 1.9]

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pywebcopy.exceptions.AccessError`

1.3.2 `config.setup_config` Method

1.3.3 List of available `configurations`