Skip to main content

Quickly find flags (words, phrases, etc) within your data.

Project description

Data Filter

pypi pypi codecov Build Status

Quickly find flags (words, phrases, etc) within your data.

Data Filter is a lightweight data cleansing tool that can be easily extended to support different data structures or processing requirements. It natively supports the following:

  • CSV files
  • Text files
  • Text strings

Table of Contents

Requirements

  • Python 3.6+

Installation

To install, simply use pipenv (or pip):

>>> pipenv install datafilter

Basic Usage

CSV

from datafilter.filters import CSV
from datafilter.flags import Flag

words = Flag(tokens=["Lorem", "ipsum"])
phrases = Flag(tokens=["Volutpat est", "mi sit amet"])
data = CSV("test.csv", flags=[words, phrases])

Text

from datafilter.filters import Text
from datafilter.flags import Flag

words = Flag(tokens=["Lorem"])
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, flags=[words])

Text File

from datafilter.filters import TextFile
from datafilter.flags import Flag

words = Flag(tokens=["Lorem", "ipsum"])
data = TextFile("test.txt", flags=[words], re_split=r"(?<=\.)")

Features

Data Filter was designed to be highly extensible. Common or useful flags and filters can be easily reused and shared. A few example use cases include:

  • Flags that detect swear words, hate speech or unwanted names / phrases for a specific topic.
  • Filters that can handle different data types such as Microsoft Word or Google Docs.
  • Filters that can handle incoming data from external APIs.

Base

Abstract base class that's subclassed by Filter and Flag.

Base includes several methods to ensure data is properly normalized, formatted and returned. The results property method is an @abstractmethod to enforce its use in subclasses.

Parameters

translations

type <list>

A list of strings that will be removed during normalization.

Default

['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', ' \t\n\r\x0b\x0c', '0123456789']

Note:

See, Flag.TRANSLATIONS.

casesensitive

type <bool>

When False, tokens are converted to lowercase during normalization.

Default

False

Methods

results

Abstract method used to return processed results. This is defined by Base subclasses.

normalize

A generator that yields normalized data. Normalization includes converting data to lowercase and removing strings.

Yields

type <dict>

Note:

Normalized data is returned in the following key/value format. While the key will always be a string, the value may be a string, list, dictionary or boolean.

{
    "original": "",
    "normalized": "",
}

makelower

Returns lowercase data.

Returns

type <str>

maketrans

Returns a translation table used during normalization.

Returns

type <dict>

Flag

Flag contains a list of tokens that will be searched for within a set of data. By default, tokens are normalized and case insensitive. Multiple Flag objects can be added to a Filter.

Parameters

Flag is a subclass of Base and inherits all parameters.

tokens

type <list>

A list of strings that will be searched for within a set of data.

Methods

Flag is a subclass of Base and inherits all methods.

results

Property method that returns a generator that yields normalized flags.

Yields

type <dict>

Note:

See normalize for data format.

Filter

Abstract base class used to create filters.

Filters normalize, parse and format data. They accepts one or more Flag objects and use them to flag rows of data when a token has been detected.

Filter includes several attributes and methods that ensure data is properly parsed and returned. It's meant to be subclassed so you can easily create and share filters that support different data types.

Parameters

Filter is a subclass of Base and inherits all parameters.

flags

type <list>

A list of Flag objects used to flag data.

bidirectional

type <bool>

When true, flag matching will be bidirectional.

Default

True

Note:

A common method of obfuscation is to reverse the offending string or phrase. This helps detect that.

Methods

Filter is a subclass of Base and inherits all methods.

get_flags

A generator that yields normalized flags.

Yields

type <dict>

parse

Returns parsed and property formatted data.

Returns

type <dict>

Example:

Assume we're searching for the token "Lorem" in a very short string.

words = Flag(tokens=["Lorem"])
data = Text("Lorem ipsum dolor sit amet", flags=[words])
print(next(data.results))

The returned result would be formatted as:

{
    "data": "Lorem ipsum dolor sit amet",
    "flagged": True,
    "describe": {
        "flags": {
            "detected": ["Lorem"],
            "count": 1,
            "frequency": {
                "Lorem": 1,
            },
        }
    },
}

CSV

Parameters

CSV is a subclass of Filter and inherits all parameters.

path

type <str>

Path to a CSV file.

Methods

CSV is a subclass of Filter and inherits all methods.

read_csv

Static method that accepts parameter stream of type TextIO and returns a generator that yields a list of CSV rows.

Yields

type <list>

Text

Parameters

Text is a subclass of Filter and inherits all parameters.

text

type <str>

A text string.

re_split

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

Text is a subclass of Filter and inherits all methods.

TextFile

Parameters

TextFile is a subclass of Filter and inherits all parameters.

path

type <str>

Path to a text file.

re_split

type <str>

A regular expression pattern or string that will be applied to text with re.split before normalization.

Methods

TextFile is a subclass of Filter and inherits all methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafilter-0.1.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datafilter-0.1.2-py2.py3-none-any.whl (7.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file datafilter-0.1.2.tar.gz.

File metadata

  • Download URL: datafilter-0.1.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3fe053773908a907359db0ffc678138aaa89fc76ede229d9bcc048f170a759c1
MD5 54a21bdd320a0948c51f9c685890eb1a
BLAKE2b-256 4bc5d5f613338259de7e52851f6d377c5eae58366626b3bd499d343c042a53ea

See more details on using hashes here.

File details

Details for the file datafilter-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: datafilter-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for datafilter-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1a18aa407c17adff6902066fa6ee94225cbdce62673f9f451c8e211cd087159b
MD5 8f59e4f118cecd0d3eb7e5dd2ca4559b
BLAKE2b-256 52a6ec50058e3f63c50bd6131c02b5e599423e844d3781dbb83286324734f85f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page