Read and write spage.
Project description
os-spage
Read and write Spage.
Spage is an incompact data structure to specify fetched record. Generally speaking, it contains four sub-blocks: url, inner_header, http_header, and data.
Spage:
- url: the URL.
- inner_header: key-values, can be used to record fetch/process info, such as fetch-time, data-digest, record-type, ect.
- http_header: key-values, server's response HTTP Header as you know.
- data: fetched data, can be flat or compressed html.
We use dict type to implements Spage. A predefined schema can be used for validating.
It is common to write Spage to size-rotate-file, we choice os-rotatefile as default back-end.
Notice:
- os-spage should not be used for strict serialization/deserialization purpose, it will lose type info when written, all data will be read as string(unicode python2) after all.
- Usually, the data stored in compressed format. You can use
zlib.decompress
method to decompress.
Offpage:
From v0.4, this libaray support reading from offpage. Offpage is another data storage format, include url, headers and series data. You can use read/open_file
methods with page_type="offpage"
to read from offpage.
From v0.5, support transform spage into offpage. You can use read/open_file
methods with page_type="s2o"
to read from spage and transform the record into offpage format. (Not fully tested yet)
Example:
from os_spage import read
f = open('your_spage', 'rb')
for offpage in read(f, page_type='s2o'):
print(offpage )
Install
pip install os-spage
Usage
- Write to size-rotate-file
from os_spage import open_file
url = 'http://www.google.com/'
inner_header = {'User-Agent': 'Mozilla/5.0', 'batchID': 'test'}
http_header = {'Content-Type': 'text/html'}
data = b"Hello world!"
f = open_file('file', 'w', roll_size='1G', compress=True)
f.write(url, inner_header=inner_header, http_header=http_header, data=data, flush=True)
f.close()
- Read from size-rotate-file
from os_spage import open_file
f = open_file('file', 'r')
for record in f.read():
print(record)
f.close()
- R/W with other file-like object
from io import BytesIO
from os_spage import read, write
s = BytesIO()
write(s, "http://www.google.com/")
s.seek(0)
for record in read(s):
print(record)
Unit Tests
$ tox
License
MIT licensed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file os-spage-0.5.1.tar.gz
.
File metadata
- Download URL: os-spage-0.5.1.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/2.7.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12f4179bc27a7534b79b5d61cc4b402e40d4ebc79cc49e6030328a6dfbf2fdcb |
|
MD5 | 6ed4572857eeabd99e431cb66e49a96b |
|
BLAKE2b-256 | e83397f17f853374ecdb919d5e62dc35918481c588b65f3b0b1f3c7644d19aa7 |