Skip to main content

Quickly find tokens (words, phrases, etc) within your data.

Project description

Data Filter

pypi pypi codecov Build Status

Quickly find tokens (words, phrases, etc) within your data.

Data Filter is a lightweight data cleansing framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:

  • CSV files
  • Text files
  • Text strings

Table of Contents

Requirements

  • Python 3.6+

Installation

To install, simply use pipenv (or pip):

>>> pipenv install datafilter

Basic Usage

Each example below returns a generator that yields parsed data.

CSV

from datafilter import CSV

tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
data = CSV("test.csv", tokens=tokens)
print(next(data.results))

Text

from datafilter import Text

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, tokens=["Lorem"])
print(next(data.results))

Text File

from datafilter import TextFile

data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
print(next(data.results))

Features

Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:

  • Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
  • Filters that can handle incoming data from external APIs.

Base

Abstract base class that's subclassed by every filter.

Base includes several methods to ensure data is properly normalized, formatted and returned. The results property method is an @abstractmethod to enforce its use in subclasses.

Parameters

tokens

type <list>

A list of strings that will be searched for within a set of data.

translations

type <list>

A list of strings that will be removed during normalization.

Default

['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', ' \t\n\r\x0b\x0c', '0123456789']

bidirectional

type <bool>

When True, token matching will be bidirectional.

Default

True

Note:

A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.

caseinsensitive

type <bool>

When True, tokens and data are converted to lowercase during normalization.

Default

True

Methods

results

Abstract method used to return processed results. This is defined within Base subclasses.

makelower

Returns lowercase data.

Returns

type <str>

maketrans

Returns a translation table used during normalization.

Returns

type <dict>

normalize

A generator that yields normalized data. Normalization includes converting data to lowercase and removing strings.

Yields

type <dict>

Note:

Normalized data is returned in the following key/value format. While the key will always be a string, the value may be a string, list, dictionary or boolean.

{
    "original": "",
    "normalized": "",
}

parse

Returns parsed and property formatted data.

Returns

type <dict>

Example:

Assume we're searching for the token "Lorem" in a very short string.

data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])
print(next(data.results))

The returned result would be formatted as:

{
    "data": "Lorem ipsum dolor sit amet",
    "flagged": True,
    "describe": {
        "tokens": {
            "detected": ["Lorem"],
            "count": 1,
            "frequency": {
                "Lorem": 1,
            },
        }
    },
}

Filters

Filters subclass and extend the Base class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.

CSV

Parameters

CSV is a subclass of Base and inherits all parameters.

path

type <str>

Path to a CSV file.

Methods

CSV is a subclass of Base and inherits all methods.

read_csv

Static method that accepts parameter stream of type TextIO and returns a generator that yields a list of CSV rows.

Yields

type <list>

Text

Parameters

Text is a subclass of Base and inherits all parameters.

text

type <str>

A text string.

re_split

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

Text is a subclass of Base and inherits all methods.

TextFile

Parameters

TextFile is a subclass of Base and inherits all parameters.

path

type <str>

Path to a text file.

re_split

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

TextFile is a subclass of Base and inherits all methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafilter-0.2.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datafilter-0.2.0-py2.py3-none-any.whl (6.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file datafilter-0.2.0.tar.gz.

File metadata

  • Download URL: datafilter-0.2.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 20787de8c01ae1f5d862c41c06914dc9cb065e6cd9e9726ac8738edb06cfa76b
MD5 2e8575c302260fe1f4ebaa9e277fac99
BLAKE2b-256 7da05843b09c1a5b767d95de1513ab4011dedeef10db9b88e59482120bb6c81c

See more details on using hashes here.

File details

Details for the file datafilter-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: datafilter-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 588eeceafb17ec49a8bb0f7c166b429e67914ff180bb40e46b6ca6d4910dbd45
MD5 af3a97e27ce16be76b1b049c0e99bfe1
BLAKE2b-256 dbcf4df710f72684c68d962c6c2697070e8d576433f5f7d87f4e8a8f53eaa5e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page