A library for efficiently loading data into Python
Project description
pytubes
=======
.. image:: https://travis-ci.org/stestagg/pytubes.svg?branch=master
:target: https://travis-ci.org/stestagg/pytubes
.. image:: https://readthedocs.org/projects/pytubes/badge/
:target: https://pytubes.readthedocs.io/en/latest/
Source: https://github.com/stestagg/pytubes
Pytubes is a library that optimizes loading datasets into memory.
At it’s core is a set of specialized c++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
Contents
--------
.. role:: noshow
:noshow:`If you're viewing this in github, please visit the docs at: https://pytubes.readthedocs.io/en/latest/`
.. toctree::
:maxdepth: 3
:name: mastertoc
intro_usage
tubes
performance
detail
=======
.. image:: https://travis-ci.org/stestagg/pytubes.svg?branch=master
:target: https://travis-ci.org/stestagg/pytubes
.. image:: https://readthedocs.org/projects/pytubes/badge/
:target: https://pytubes.readthedocs.io/en/latest/
Source: https://github.com/stestagg/pytubes
Pytubes is a library that optimizes loading datasets into memory.
At it’s core is a set of specialized c++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
Contents
--------
.. role:: noshow
:noshow:`If you're viewing this in github, please visit the docs at: https://pytubes.readthedocs.io/en/latest/`
.. toctree::
:maxdepth: 3
:name: mastertoc
intro_usage
tubes
performance
detail
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pytubes-0.5.6-1.tar.gz
(616.7 kB
view hashes)
Built Distributions
pytubes-0.5.6-cp36-cp36m-win32.whl
(195.1 kB
view hashes)
Close
Hashes for pytubes-0.5.6-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d84b8d6ce54e27ab6baa63ecc96183980ed6e1baae8b3d8d5dda0fc51a533b05 |
|
MD5 | 6a1305840fc2462fe4975ffa1e7691a6 |
|
BLAKE2b-256 | 427015e519228a4367d0afe53e016dffec7364e96121a080dfa096785f47c5e8 |
Close
Hashes for pytubes-0.5.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ade2131145f5e0be5c5221d375f4bf2ec1b258de8fbea98b57a973fa3ef659a5 |
|
MD5 | 294c2a03e55478857d21def6a95468d6 |
|
BLAKE2b-256 | 1f32eae97c95d21381de599e9614d3ddef909ff5a3a3c5afcefa45c69d69c556 |
Close
Hashes for pytubes-0.5.6-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9576a6974eb747933b0c4c2711e4101313253f1218d086947c9b24e64b6f4030 |
|
MD5 | 68eaa843e87c8cf658b9bd144e0b2627 |
|
BLAKE2b-256 | ea86d2dabb6024268f7aa51a97d51d13a5a72c936bf1b1d0950094f8e5f3551e |
Close
Hashes for pytubes-0.5.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11cc0badf295b05b4993305bf0378e706747152d4653f2feb5f05dbeaa375e3b |
|
MD5 | a1319c3feb9977dea660e5c5912b3aad |
|
BLAKE2b-256 | 598c20c87dd45d2bc1b8c4d760c2ae6c31258db9e1f1e9321894bb7233b3f020 |
Close
Hashes for pytubes-0.5.6-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 237206375b755c69bb1795dbb8055d94b0e183ef089f067d72ccf85539460503 |
|
MD5 | e8a263f06bff785843472e9fa39e21a7 |
|
BLAKE2b-256 | af4511ef32ec17aa3a6a41cea4b3ed460f470905cc7cec3c8a25e6962a60cfe7 |
Close
Hashes for pytubes-0.5.6-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4a62908d585bf84d27c472bdd17505e0f5c56e80833245861f743838119095e |
|
MD5 | e7b09b012728cbb6b37a2a85e39b6ce3 |
|
BLAKE2b-256 | b5befd710d440f1f8ef4409820afaf29f14a96292eb53bed319b79cf0e297d97 |
Close
Hashes for pytubes-0.5.6-cp34-cp34m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54b18cea1bbbdff77bf7a1c07264e52acf95fe2a15098deb818a892cac77667b |
|
MD5 | b7524ec723d8eb27897259b5e937ec70 |
|
BLAKE2b-256 | cdbe68774dba4bc7d31d83fd55e0b0db7ad0f582495980b32fce93f90e65ef0b |