A library for efficiently loading data into Python
Project description
pytubes
=======
.. image:: https://travis-ci.org/stestagg/pytubes.svg?branch=master
:target: https://travis-ci.org/stestagg/pytubes
.. image:: https://readthedocs.org/projects/pytubes/badge/
:target: https://pytubes.readthedocs.io/en/latest/
Source: https://github.com/stestagg/pytubes
Pytubes is a library that optimizes loading datasets into memory.
At it’s core is a set of specialized C++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
Contents
--------
.. role:: noshow
:noshow:`If you're viewing this in github, please visit the docs at: https://pytubes.readthedocs.io/en/latest/`
.. toctree::
:maxdepth: 3
:name: mastertoc
intro_usage
tubes
performance
detail
=======
.. image:: https://travis-ci.org/stestagg/pytubes.svg?branch=master
:target: https://travis-ci.org/stestagg/pytubes
.. image:: https://readthedocs.org/projects/pytubes/badge/
:target: https://pytubes.readthedocs.io/en/latest/
Source: https://github.com/stestagg/pytubes
Pytubes is a library that optimizes loading datasets into memory.
At it’s core is a set of specialized C++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
Contents
--------
.. role:: noshow
:noshow:`If you're viewing this in github, please visit the docs at: https://pytubes.readthedocs.io/en/latest/`
.. toctree::
:maxdepth: 3
:name: mastertoc
intro_usage
tubes
performance
detail
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pytubes-0.7.0.tar.gz
(823.8 kB
view hashes)
Built Distributions
Close
Hashes for pytubes-0.7.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4b7c6a726c316b9720ee2fef5a0c603f63be01d74d11090db32c6e7eadcb160 |
|
MD5 | 9189b54cfbeca88cf9c3e48184050753 |
|
BLAKE2b-256 | a97f38744d6a6fc051fafeca107a8ac18d5309006902193d3e3bc135f313fd7d |
Close
Hashes for pytubes-0.7.0-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7124aadb7aa56d9958b6f5d0b5d5f8b195e396adcebb02806da4844c8afadc7 |
|
MD5 | a3f027dda3b7b05881efa2fd609026b7 |
|
BLAKE2b-256 | 15dea2e3d28fa54884f9f6d94c5a5d077dd5291959dd079917e5f485cf31cc95 |
Close
Hashes for pytubes-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc7990d31d728d69a3bd7669179efe09ff71a7d01e786073c55e74267eb6d877 |
|
MD5 | 97fe042c01631b11ef996ab8ca2e1c93 |
|
BLAKE2b-256 | 92e46f3d0bfe3a89f99862b47a051992cf47e3f55b7a56e2f6536dbaef7efe1d |
Close
Hashes for pytubes-0.7.0-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 656adce60010eb8f38bc4a7bea75dcbdb485bb31a4aa3f58806fe116b7c1e38a |
|
MD5 | a5aaa09a1a2b38a75840c75db7a880f7 |
|
BLAKE2b-256 | e069bf1c08c087f466c9a714bf3d6c710485aa75d76cc56935a8e3e75771923d |
Close
Hashes for pytubes-0.7.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7da7e433590f8c4b0dac04635de74021a07563494682647e88df87b6366f0490 |
|
MD5 | 0de3a8498c1325648a67e61f9a494f34 |
|
BLAKE2b-256 | dff9029d13fae595565829617722d65471f1802669f23ebd2ddb5d19741d55e0 |
Close
Hashes for pytubes-0.7.0-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61fb4a33672adfc39f10d74cc6875c7a36739e495aa003a99a13f81aca21d88a |
|
MD5 | d456efb0173fb13936f855a7fa48913c |
|
BLAKE2b-256 | dfcee010d9ff44924efd9b5b4e0a71e7609281a0b1eb8db3cb494dc22edab09a |