Skip to main content

Data retrieval from remote archives

Project description

PyPI Version Supported Python Versions Build Status Wheel Status

Overview

Advarchs is simple tool for retrieving data from web archives. It is especially useful if you are working with remote data stored in compressed spreadsheets or of similar format.

Getting Started

Say you need to perform some data anlytics on an excel spreadsheet that gets refreshed every month and stored in RAR format. You can target a that file and convert it to a pandas dataframe with the following procedure:

import pd
import os
import tempfile
from advarchs import webfilename,extract_web_archive

TEMP_DIR = tempfile.gettempdir()

url = "http://www.site.com/archive.rar"
arch_file_name = webfilename(url)
arch_path = os.path.join(TEMP_DIR, arch_file_name)
xlsx_files = extract_web_archive(url, arch_path, ffilter=['xlsx'])
for xlsx_f in xlsx_files:
    xlsx = pd.ExcelFile(xlsx_f)

...

Requirements

  • Python 3.5+

  • p7zip

Special note

On CentOS and Ubuntu <= 16.04, the following packages are needed:

  • unrar

Installation

pip install advarchs

Contributing

See CONTRIBUTING

Code of Conduct

This project adheres to the Contributor Covenant 1.2. By participating, you are advised to adhere to this Code of Conduct in all your interactions with this project.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advarchs-0.1.7.tar.gz (10.4 kB view hashes)

Uploaded Source

Built Distribution

advarchs-0.1.7-py3-none-any.whl (10.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page