Quickly find tokens (words, phrases, etc) within your data.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Project description

Data Filter

Quickly find tokens (words, phrases, etc) within your data.

Data Filter is a lightweight data cleansing framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:

CSV files
Text files
Text strings

Requirements
Installation
Basic Usage
Features
- Base
- Filters
  - CSV
  - Text
  - TextFile

Requirements

Python 3.6+

Installation

To install, simply use pipenv (or pip):

>>> pipenv install datafilter

Basic Usage

CSV

from datafilter import CSV

tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
data = CSV("test.csv", tokens=tokens)
data.save("filtered.csv")

In this example, we open a CSV file, search all rows for normalized tokens and flag them. The save method creates a new CSV file with all rows that weren't flagged.

Text

from datafilter import Text

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, tokens=["Lorem"])
print(next(data.results()))

In this example, we search a text string for normalized tokens. We can then iterator over the results using the .results() method, which returns a generator that yields formatted results.

Text File

from datafilter import TextFile

data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
print(next(data.results()))

In this example, we open a text file and split the data based on a regular expression defined by re_split.

Features

Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:

Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
Filters that can handle incoming data from external APIs.

Base

Abstract base class that's subclassed by every filter.

Base includes several methods to ensure data is properly normalized, formatted and returned. The .results() method is an @abstractmethod to enforce its use in subclasses.

Parameters

`tokens`

type <list>

A list of strings that will be searched for within a set of data.

`translations`

type <list>

A list of strings that will be removed during normalization.

Default

['0123456789', '(){}[]<>!?.:;,`\'"@#$%^&*+-|=~–—/\\_', '\t\n\r\x0c\x0b']

`bidirectional`

type <bool>

When True, token matching will be bidirectional.

Default

True

Note:

A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.

`caseinsensitive`

type <bool>

When True, tokens and data are converted to lowercase during normalization.

Default

True

Methods

`.results()`

Abstract method used to return results within a filter. This is defined by a Base subclass

`.maketrans()`

Returns a translation table used during normalization.

Returns

type <dict>

`.normalize(data)`

Returns normalized data. Normalization includes converting data to lowercase and removing strings.

Accepts parameter data.

Returns

type <tuple>

Note:

Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.

`.parse(data)`

Returns parsed and formatted data.

Accepts parameter data.

Returns

type <dict>

Example

Assume we're searching for the token "Lorem" in a very short text string.

data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])
print(next(data.results()))

The returned result would be formatted as:

{
    "data": "Lorem ipsum dolor sit amet",
    "flagged": True,
    "describe": {
        "tokens": {
            "detected": ["Lorem"],
            "count": 1,
            "frequency": {
                "Lorem": 1,
            },
        }
    },
}

Note:

.parse() should never be called directly. Use .results() instead.

Filters

Filters subclass and extend the Base class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.

CSV

Parameters

CSV is a subclass of Base and inherits all parameters.

`path`

type <str>

Path to a CSV file.

Methods

CSV is a subclass of Base and inherits all methods.

`.save(path)`

Saves results to a file.

Accepts parameter path. path is the absolute path and filename of the new file.

Text

Parameters

Text is a subclass of Base and inherits all parameters.

`text`

type <str>

A text string.

`re_split`

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

Text is a subclass of Base and inherits all methods.

`.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter path and endofline. path is the absolute path and filename of the new file. endofline is a line delimiter that will be added to the end of every row.

TextFile

Parameters

TextFile is a subclass of Base and inherits all parameters.

`path`

type <str>

Path to a text file.

`re_split`

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

TextFile is a subclass of Base and inherits all methods.

`.save(path, endofline=" ")`

Saves results to a file.

Accepts parameter path and endofline. path is the absolute path and filename of the new file. endofline is a line delimiter that will be added to the end of every row.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.4.2

Aug 20, 2019

0.4.1

Aug 14, 2019

0.3.0

Aug 13, 2019

0.2.0

Aug 12, 2019

0.1.2

Aug 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafilter-0.4.2.tar.gz (7.0 kB view details)

Uploaded Aug 20, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datafilter-0.4.2-py2.py3-none-any.whl (7.6 kB view details)

Uploaded Aug 20, 2019 Python 2Python 3

File details

Details for the file datafilter-0.4.2.tar.gz.

File metadata

Download URL: datafilter-0.4.2.tar.gz
Upload date: Aug 20, 2019
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`32588f7a941653f27b5d0ed195a3bcc34f0e5d8465b5c5efb2fd5d2273085110`
MD5	`d3d7cd39ca78ac6dc9076f6d9f43d036`
BLAKE2b-256	`f841c512c54c51acc2b6bb7cc6e5842136cf7a38fa819ead5d02bed74b538fe9`

See more details on using hashes here.

File details

Details for the file datafilter-0.4.2-py2.py3-none-any.whl.

File metadata

Download URL: datafilter-0.4.2-py2.py3-none-any.whl
Upload date: Aug 20, 2019
Size: 7.6 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.4.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c44cda907badbdd3151e75968f487f8c581db26c2349820fe497517a23ef0a49`
MD5	`662da2bb309338c58d997955f98594bf`
BLAKE2b-256	`116fd74a3291e26831a4148108376f50d6df2dde9550448733d0cb33022d8bed`

See more details on using hashes here.

datafilter 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Filter

Table of Contents

Requirements

Installation

Basic Usage

CSV

Text

Text File

Features

Base

Parameters

tokens

translations

bidirectional

caseinsensitive

Methods

.results()

.maketrans()

.normalize(data)

.parse(data)

Filters

CSV

Parameters

path

Methods

.save(path)

Text

Parameters

text

re_split

Methods

.save(path, endofline=" ")

TextFile

Parameters

path

re_split

Methods

.save(path, endofline=" ")

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tokens`

`translations`

`bidirectional`

`caseinsensitive`

`.results()`

`.maketrans()`

`.normalize(data)`

`.parse(data)`

`path`

`.save(path)`

`text`

`re_split`

`.save(path, endofline=" ")`

`path`

`re_split`

`.save(path, endofline=" ")`