Mirrors online webpages and complete websites.
PyWebCopy © 2
Created By : Raja Tomar
License : MIT
Mirrors Complete webpages with python.
Website mirroring and archiving tool written in Python
Archive any online website and its assets, css, js and
images for offilne reading, storage or whatever reasons.
It's easy with
Why it's great? because it -
- have a single-function basic usages
- lots of configuration for many custom needs
generatorfunctions to ease extraction of any part of website
Email me at
firstname.lastname@example.org of any query :)
pywebcopy is available on PyPi and is easily installable using
pip install pywebcopy
1.2 Basic Usages
1.2.1 Direct Function Methods
To mirror any single page, just type in python console
from pywebcopy.core import save_webpage save_webpage( url='http://example-site.com/index.html', download_loc='path/to/downloads' )
To mirror full website (This could overload the target server, So, be careful)
from pywebcopy.core import save_webpage save_webpage( url='http://example-site.com/index.html', download_loc='path/to/downloads', copy_all=True )
1.2.2 Object Creation Method
from pywebcopy.structures import WebPage url = 'http://example-site.com/index.html' download_loc = 'path/to/downloads/folder' wp = WebPage(url, download_loc) # if you want assets only wp.save_assets_only() # if you want html only wp.save_html_only() # if you want complete webpage wp.save_complete() # bonus : you can also use any beautiful_soup methods on it links = wp.find_all('a', href=True)
You will now have a folder at
download_loc with all the webpage and its linked files ready to be used.
Just browse it as would on any browser!
pywebcopy is highly configurable.
1.3.1 Direct Call Method
To change any configuration, just pass it to the
from pywebcopy.core import save_webpage save_webpage( url='http://some-site.com/', # required download_loc='path/to/downloads/', # required # config keys are case-insensitive any_config_key='new_value', another_config_key='another_new_value', ... # add many as you want :) )
This function is changed from
You can manually configure every configuration by using a
import pywebcopy url = 'http://example-site.com/index.html' download_loc = 'path/to/downloads/' pywebcopy.config.setup_config(url, download_loc) # done! >>> pywebcopy.config.config['url'] 'http://example-site.com/index.html' >>> pywebcopy.config.config['mirrors_dir'] 'path/to/downloads' >>> pywebcopy.config.config['project_name'] 'example-site.com' ## You can also change any of these by just adding param to ## `setup_config` call >>> pywebcopy.config.setup_config(url, download_loc,project_name='Your-Project', ...) ## You can also change any config even after ## the `setup_config` call pywebcopy.config.config['url'] = 'http://url-changed.com' # rest of config remains unchanged
1.3.3 List of available
below is the list of
config keys with their
default values :
told you there were plenty of
config vars available!
For any queries related to this project you can email me at
You can help in many ways:
- reporting bugs
- sending me patches to fix or improve the code
- in generating the complete documentation of this project
1.5 Undocumented Features
I built many utils and classes in this project to ease the tasks I was trying to do.
But, these task are also suitable for general purpose use.
if you want, you can help in generating suitable
documentation for these undocumented ones, then you can always email me.
core.setup_configfunction is changed to
utils.tracedecorator, which will print function_name, args, kwargs and return value when debug config key is True.
- new html-parsers ('html5lib', 'lxml') are supported for better webpages.
- html-parser is now defaulted to 'lxml'. You can use any through new
- fixed issue while changing
user-agentkey cracked webpages. You can now use any browser's user-agent id and it will get exact same page downloaded.
- fixed issue in
generators.extract_css_urlswhich was caused by
bytesdifference in python3.
- fixed issues while modules importing. (Thanks "Ð˜Ð»ÑŒÑ Ð˜Ð³Ð¾Ñ€ÐµÐ²Ð¸Ñ‡").
errorhandlingto required functions
initfunction is replaced with
- three new
configautomation functions are added -
core.setup_config(creates every ideal config just from url and download location)
config.reset_config(resets the configuration to default state)
config.update_config(manual-mode version of
generators.generate_relative_pathsto a single function
- rewrite of majority of functions
- new module
urlis checked and resolved of any redirection before starting any work functions.
clean_upwere fixed which cleaned the dir before the log was completely written.
initcall now takes
urlarg by default and could raise a error when not supplied
- professional looking log entries
- rewritten archiving system now uses
exceptionshandling to prevent errors and eventual archive corruption
- more redundant code
- modules are now separated based on type e.g. Core, Generators, Utils etc.
- new helper functions and class
- Compatible with Python 2.6, 2.7, 3.6, 3.7
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size pywebcopy-2.0.3.tar.gz (21.6 kB)||File type Source||Python version None||Upload date||Hashes View|