A library for efficiently loading data into Python
Project description
pytubes
=======
A library for loading data into Python
.. toctree::
:caption: Contents
:maxdepth: 2
:name: mastertoc
tubes
detail
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
What is it?
-----------
Pytubes is a library that optimizes loading dataset into memory.
At it's core is a set of specialized c++ classes that can be chained together
to load and manipulate data using a standard iterator pattern. Around this
there is a cython extension module that makes defining and configuring a tube
simple and straight-forward.
A lot of the cost of loading data using pure python is typically centered around
function call overhead and allocating/copying object data.
Pytubes tackles these bottlenecks by using a number of strategies:
- iterator hot-loops are pure c++ function calls
- zero-copy views onto array data
- strict epoch-based lifetime rules avoid reference counting or GC during iteration
- where possible, zero allocations during iteration
- avoiding creating python objects where possible
These optimizations lead to significant performance improvements over pure python,
despite offering complex loading functionality.
Usage
-----
Usage is very simple:
#. Import ``tubes``
#. create an input tube (currently either: :class:`tube.Each` or :class:`tube.Count`) to get some data into the tube
#. continue to methods on the input tube to build up each step of the processing (e.g. ``read_files().split().json()``...)
#. Iterate over the tube to generate the data, by either:
- Calling ``list(tube)``
- looping over it in a for-loop: ``for item in tube:``
- or: Calling ``x = iter(tube)``, and then ``next(x)`` repeatedly.
Installation
------------
**From PyPi**::
$ pip install pytubes
**From source**::
$ pip install -r build_requirements.txt
$ cd pyx
$ python setup.py install
API
---
All tube methods are documented here: :ref:`api`
=======
A library for loading data into Python
.. toctree::
:caption: Contents
:maxdepth: 2
:name: mastertoc
tubes
detail
Simple Example
--------------
>>> from tubes import Each
>>> import glob
>>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames
.read_files() # Read each file, chunk by chunk
.split() # Split the file, line-by-line
.json() # parse json
.get('country_code', 'null')) # extract field named 'country_code'
>>> set(tube) # collect results in a set
{'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}
More Complex Example
--------------------
>>> from tubes import Each
>>> import glob
>>> x = (Each(glob.glob('*.jsonz'))
.map_files()
.gunzip()
.split(b'\n')
.json()
.enumerate()
.skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB'))
.multi(lambda x: (
x.slot(0),
x.slot(1).get('timestamp', 'null'),
x.slot(1).get('country_code', 'null'),
x.slot(1).get('url', 'null'),
x.slot(1).get('file', '{}').get('filename', 'null'),
x.slot(1).get('file', '{}').get('project'),
x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('python', 'null'),
x.slot(1).get('details', '{}').get('system', 'null'),
x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'),
x.slot(1).get('details', '{}').get('cpu', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'),
x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'),
))
)
>>> print(list(x)[-3])
(15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')
What is it?
-----------
Pytubes is a library that optimizes loading dataset into memory.
At it's core is a set of specialized c++ classes that can be chained together
to load and manipulate data using a standard iterator pattern. Around this
there is a cython extension module that makes defining and configuring a tube
simple and straight-forward.
A lot of the cost of loading data using pure python is typically centered around
function call overhead and allocating/copying object data.
Pytubes tackles these bottlenecks by using a number of strategies:
- iterator hot-loops are pure c++ function calls
- zero-copy views onto array data
- strict epoch-based lifetime rules avoid reference counting or GC during iteration
- where possible, zero allocations during iteration
- avoiding creating python objects where possible
These optimizations lead to significant performance improvements over pure python,
despite offering complex loading functionality.
Usage
-----
Usage is very simple:
#. Import ``tubes``
#. create an input tube (currently either: :class:`tube.Each` or :class:`tube.Count`) to get some data into the tube
#. continue to methods on the input tube to build up each step of the processing (e.g. ``read_files().split().json()``...)
#. Iterate over the tube to generate the data, by either:
- Calling ``list(tube)``
- looping over it in a for-loop: ``for item in tube:``
- or: Calling ``x = iter(tube)``, and then ``next(x)`` repeatedly.
Installation
------------
**From PyPi**::
$ pip install pytubes
**From source**::
$ pip install -r build_requirements.txt
$ cd pyx
$ python setup.py install
API
---
All tube methods are documented here: :ref:`api`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pytubes-0.5.5.tar.gz
(487.8 kB
view hashes)
Built Distributions
Close
Hashes for pytubes-0.5.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 216a0f429ebfb46b7d65af907f30c0d987aba4a56ac2fdfdf5ea5fba34d19adb |
|
MD5 | 3a90a884f7b337115b9c542f0aa95cd8 |
|
BLAKE2b-256 | 02d3cb3dbd3f1eeb34d8944cd285cc51a6293517e9dd07a7cf683455c19c3e21 |
Close
Hashes for pytubes-0.5.5-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3bc38652b76b2dd59339d828f1cbcd1b1ac94120d27ec54abcad7b34f8e73b5 |
|
MD5 | 39f336aef8e2b8020c67bd99385b49be |
|
BLAKE2b-256 | 66777f3412eb3c825fa2335e3a01f0b2ecf4ffeaf6067a98e030dbe3528507ee |
Close
Hashes for pytubes-0.5.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba58e5d1573de9efadf75283cac85339a95a43ce784a1d34678da39646751fde |
|
MD5 | c515b1074803bbeb4a9dab9b7c36edf5 |
|
BLAKE2b-256 | 90728e0e3a9722f92f6b658b2a5f7b4ea40f020f2532cac4c021793fd64e8778 |
Close
Hashes for pytubes-0.5.5-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d6eba7a41b96f5c5cbd7936b58624db33caa5479febdab7fc6f012846ec2437 |
|
MD5 | 022eedc727bda899ad66e80bd3f354b2 |
|
BLAKE2b-256 | d83a07fa442987230948ca34d288cf131e87a747b061bf88a642087a4de70c0c |
Close
Hashes for pytubes-0.5.5-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea6c6cca36f748fc5e7b33b975126bfd0cc8d7d160ef38788cc4cfc12d2b5be2 |
|
MD5 | 202999f4d5637706f62de8d2d711347e |
|
BLAKE2b-256 | f582520a5a2d6a0014c3aa8562ca8c0d25c678ec4cc0d5fdf6a5b572aca76b52 |
Close
Hashes for pytubes-0.5.5-cp34-cp34m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 565fe86e6c65cf0fd732c8afc4dd61afe3e30830a6b3829a27cfac94e6044728 |
|
MD5 | 2121ec39ddbe25b889d38087810fa877 |
|
BLAKE2b-256 | 45affe18e77b03e7f6bd179226f4d631e0e0e4b24389995a13f4b3ee44d9a60a |