featureflow

No project description provided

These details have not been verified by PyPI

Project links

Project description

Python 3

featureflow

featureflow is a python library that allows users to build feature extraction pipelines in a declarative way, and control how and where those features are persisted.

Usage

The following example will compute word frequency in individual text documents, and then over the entire corpus of documents, but featureflow isn’t limited to text data. It’s designed to work well with sequential/streaming data (e.g. audio or video) that is often processed iteratively, in small chunks.

You can see all the code in this example in one place here.

We can define a graph of processing nodes like this:

import featureflow as ff


@ff.simple_in_memory_settings
class Document(ff.BaseModel):
    """
    Define the processing graph needed to extract document-level features,
    whether, and how those features should be persisted.
    """
    raw = ff.ByteStreamFeature(
        ff.ByteStream,
        chunksize=128,
        store=True)

    checksum = ff.JSONFeature(
        CheckSum,
        needs=raw,
        store=True)

    tokens = ff.Feature(
        Tokenizer,
        needs=raw,
        store=False)

    counts = ff.JSONFeature(
        WordCount,
        needs=tokens,
        store=True)

We can define the individual processing “nodes” referenced in the graph above like this:

import featureflow as ff
from collections import Counter
import re
import hashlib

class Tokenizer(ff.Node):
    """
    Tokenize a stream of text into individual, normalized (lowercase)
    words/tokens
    """
    def __init__(self, needs=None):
        super(Tokenizer, self).__init__(needs=needs)
        self._cache = ''
        self._pattern = re.compile('(?P<word>[a-zA-Z]+)\W+')

    def _enqueue(self, data, pusher):
        self._cache += data.decode()

    def _dequeue(self):
        matches = list(self._pattern.finditer(self._cache))
        if not matches:
            raise ff.NotEnoughData()
        last_boundary = matches[-1].end()
        self._cache = self._cache[last_boundary:]
        return matches

    def _process(self, data):
        yield map(lambda x: x.groupdict()['word'].lower(), data)


class WordCount(ff.Aggregator, ff.Node):
    """
    Keep track of token frequency
    """
    def __init__(self, needs=None):
        super(WordCount, self).__init__(needs=needs)
        self._cache = Counter()

    def _enqueue(self, data, pusher):
        self._cache.update(data)


class CheckSum(ff.Aggregator, ff.Node):
    """
    Compute the checksum of a text stream
    """
    def __init__(self, needs=None):
        super(CheckSum, self).__init__(needs=needs)
        self._cache = hashlib.sha256()

    def _enqueue(self, data, pusher):
        self._cache.update(data)

    def _process(self, data):
        yield data.hexdigest()

We can also define a graph that will process an entire corpus of stored document features:

import featureflow as ff

@ff.simple_in_memory_settings
class Corpus(ff.BaseModel):
    """
    Define the processing graph needed to extract corpus-level features,
    whether, and how those features should be persisted.
    """
    docs = ff.Feature(
        lambda doc_cls: (doc.counts for doc in doc_cls),
        store=False)

    total_counts = ff.JSONFeature(
        WordCount,
        needs=docs,
        store=True)

Finally, we can execute these processing graphs and access the stored features like this:

from __future__ import print_function
import argparse

def process_urls(urls):
    for url in urls:
        Document.process(raw=url)


def summarize_document(doc):
    return 'doc {_id} with checksum {cs} contains "the" {n} times'.format(
            _id=doc._id,
            cs=doc.checksum,
            n=doc.counts.get('the', 0))


def process_corpus(document_cls):
    corpus_id = Corpus.process(docs=document_cls)
    return Corpus(corpus_id)


def summarize_corpus(corpus):
    return 'The entire text corpus contains "the" {n} times'.format(
        n=corpus.total_counts.get("the", 0))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--url',
        help='specify one or more urls of text files to ingest',
        required=True,
        action='append')
    args = parser.parse_args()

    process_urls(args.url)

    for doc in Document:
        print(summarize_document(doc))

    corpus = process_corpus(Document)
    print(summarize_corpus(corpus))

To see this in action we can:

python wordcount.py \
    --url http://textfiles.com/food/1st_aid.txt \
    --url http://textfiles.com/food/antibiot.txt \
    ...

Installation

Python headers are required. You can install by running:

apt-get install python-dev

Numpy is optional. If you’d like to use it, the Anaconda distribution is highly recommended.

Finally, just

pip install featureflow

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.0.3

Mar 2, 2020

3.0.2

Mar 2, 2020

3.0.1

Mar 7, 2019

3.0.0

Mar 2, 2019

2.12.1

Nov 7, 2018

2.12.0

Nov 7, 2018

2.11.0

Nov 7, 2018

2.9.0

Jun 26, 2018

2.8.0

May 31, 2018

2.7.13

May 30, 2018

2.7.12

Mar 1, 2018

2.7.11

Feb 28, 2018

2.7.10

Feb 22, 2018

2.7.9

Feb 21, 2018

2.6.9

Jan 5, 2018

2.5.9

Jan 3, 2018

2.4.9

Oct 22, 2017

2.4.8

Oct 21, 2017

2.4.7

Oct 19, 2017

2.4.6

Oct 18, 2017

2.4.5

Oct 17, 2017

2.4.4

Oct 17, 2017

2.3.4

Oct 13, 2017

2.2.4

Oct 6, 2017

2.2.3

Sep 30, 2017

2.2.1

Sep 30, 2017

2.1.2

Sep 28, 2017

2.1.1

Sep 23, 2017

2.0.1

Sep 22, 2017

2.0.0

Sep 22, 2017

1.21.14

Sep 8, 2017

1.20.14

Sep 4, 2017

1.19.14

Sep 4, 2017

1.17.14

Jun 24, 2017

1.16.14

Mar 5, 2017

1.16.13

Mar 5, 2017

1.16.12

Aug 28, 2016

1.16.11

Jun 16, 2016

1.16.10

Jun 16, 2016

1.15.10

Jun 14, 2016

1.14.10

Jun 9, 2016

1.13.10

Jun 7, 2016

1.12.10

Jun 5, 2016

1.11.10

May 30, 2016

1.10.10

May 21, 2016

1.9.10

May 21, 2016

1.8.10

May 17, 2016

1.8.9

May 15, 2016

0.8.9

May 13, 2016

0.7.9

May 13, 2016

0.6.9

May 13, 2016

0.5.9

May 7, 2016

0.5.8

May 7, 2016

0.5.6

Apr 29, 2016

0.5.5

Apr 20, 2016

0.5.4

Apr 1, 2016

0.5.1

Mar 31, 2016

0.4.1

Mar 27, 2016

0.3

Mar 13, 2016

0.2

Mar 8, 2016

0.1

Mar 8, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featureflow-3.0.3.tar.gz (36.0 kB view details)

Uploaded Mar 4, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

featureflow-3.0.3-py3.7.egg (122.1 kB view details)

Uploaded Mar 2, 2020 Egg

File details

Details for the file featureflow-3.0.3.tar.gz.

File metadata

Download URL: featureflow-3.0.3.tar.gz
Upload date: Mar 4, 2020
Size: 36.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for featureflow-3.0.3.tar.gz
Algorithm	Hash digest
SHA256	`23624903672b611bb30be622eb46e058e4b37c8e4983ad6686f98cc4666997e8`
MD5	`952a83f5a67963a311c35d5fc8a2182e`
BLAKE2b-256	`5c867f0e83f59b92666dea3065eb8998df85ba00122e4554c353ee821fa67d64`

See more details on using hashes here.

File details

Details for the file featureflow-3.0.3-py3.7.egg.

File metadata

Download URL: featureflow-3.0.3-py3.7.egg
Upload date: Mar 2, 2020
Size: 122.1 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for featureflow-3.0.3-py3.7.egg
Algorithm	Hash digest
SHA256	`c7673e14320850e4ef5bb06e641bf22a732d66bf0c9eddded0c899fffc54b1c5`
MD5	`784c77ad4990323ae4abeb4baa7c2de6`
BLAKE2b-256	`666702d4cb857106345315cbb180730a4287459f4bfd68e3381e077ba2e7deb9`

See more details on using hashes here.

featureflow 3.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

featureflow

Usage

Installation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes