Quickly find flags (words, phrases, etc) within your data.
Project description
Data Filter
Quickly find flags (words, phrases, etc) within your data.
Data Filter is a lightweight data cleansing tool that can be easily extended to support different data structures or processing requirements. It natively supports the following:
- CSV files
- Text files
- Text strings
Table of Contents
Requirements
- Python 3.6+
Installation
To install, simply use pipenv (or pip):
>>> pipenv install datafilter
Basic Usage
CSV
from datafilter.filters import CSV
from datafilter.flags import Flag
words = Flag(tokens=["Lorem", "ipsum"])
phrases = Flag(tokens=["Volutpat est", "mi sit amet"])
data = CSV("test.csv", flags=[words, phrases])
Text
from datafilter.filters import Text
from datafilter.flags import Flag
words = Flag(tokens=["Lorem"])
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, flags=[words])
Text File
from datafilter.filters import TextFile
from datafilter.flags import Flag
words = Flag(tokens=["Lorem", "ipsum"])
data = TextFile("test.txt", flags=[words], re_split=r"(?<=\.)")
Features
Data Filter was designed to be highly extensible. Common or useful flags and filters can be easily reused and shared. A few example use cases include:
- Flags that detect swear words, hate speech or unwanted names / phrases for a specific topic.
- Filters that can handle different data types such as Microsoft Word or Google Docs.
- Filters that can handle incoming data from external APIs.
Base
Abstract base class that's subclassed by Filter and Flag.
Base includes several methods to ensure data is properly normalized, formatted and returned. The results property method is an @abstractmethod to enforce its use in subclasses.
Parameters
translations
type <list>
A list of strings that will be removed during normalization.
Default
['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', ' \t\n\r\x0b\x0c', '0123456789']
Note:
See, Flag.TRANSLATIONS.
casesensitive
type <bool>
When False, tokens are converted to lowercase during normalization.
Default
False
Methods
results
Abstract method used to return processed results. This is defined by Base subclasses.
normalize
A generator that yields normalized data. Normalization includes converting data to lowercase and removing strings.
Yields
type <dict>
Note:
Normalized data is returned in the following key/value format. While the key will always be a string, the value may be a string, list, dictionary or boolean.
{ "original": "", "normalized": "", }
makelower
Returns lowercase data.
Returns
type <str>
maketrans
Returns a translation table used during normalization.
Returns
type <dict>
Flag
Flag contains a list of tokens that will be searched for within a set of data. By default, tokens are normalized and case insensitive. Multiple Flag objects can be added to a Filter.
Parameters
Flag is a subclass of Base and inherits all parameters.
tokens
type <list>
A list of strings that will be searched for within a set of data.
Methods
Flag is a subclass of Base and inherits all methods.
results
Property method that returns a generator that yields normalized flags.
Yields
type <dict>
Note:
See normalize for data format.
Filter
Abstract base class used to create filters.
Filters normalize, parse and format data. They accepts one or more Flag objects and use them to flag rows of data when a token has been detected.
Filter includes several attributes and methods that ensure data is properly parsed and returned. It's meant to be subclassed so you can easily create and share filters that support different data types.
Parameters
Filter is a subclass of Base and inherits all parameters.
flags
type <list>
A list of Flag objects used to flag data.
bidirectional
type <bool>
When true, flag matching will be bidirectional.
Default
True
Note:
A common method of obfuscation is to reverse the offending string or phrase. This helps detect that.
Methods
Filter is a subclass of Base and inherits all methods.
get_flags
A generator that yields normalized flags.
Yields
type <dict>
parse
Returns parsed and property formatted data.
Returns
type <dict>
Example:
Assume we're searching for the token "Lorem" in a very short string.
words = Flag(tokens=["Lorem"]) data = Text("Lorem ipsum dolor sit amet", flags=[words]) print(next(data.results))The returned result would be formatted as:
{ "data": "Lorem ipsum dolor sit amet", "flagged": True, "describe": { "flags": { "detected": ["Lorem"], "count": 1, "frequency": { "Lorem": 1, }, } }, }
CSV
Parameters
CSV is a subclass of Filter and inherits all parameters.
path
type <str>
Path to a CSV file.
Methods
CSV is a subclass of Filter and inherits all methods.
read_csv
Static method that accepts parameter stream of type TextIO and returns a generator that yields a list of CSV rows.
Yields
type <list>
Text
Parameters
Text is a subclass of Filter and inherits all parameters.
text
type <str>
A text string.
re_split
type <str>
A regular expression pattern or string that will be applied to text with re.split before normalization.
Methods
Text is a subclass of Filter and inherits all methods.
TextFile
Parameters
TextFile is a subclass of Filter and inherits all parameters.
path
type <str>
Path to a text file.
re_split
type <str>
A regular expression pattern or string that will be applied to text with re.split before normalization.
Methods
TextFile is a subclass of Filter and inherits all methods.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datafilter-0.1.2.tar.gz.
File metadata
- Download URL: datafilter-0.1.2.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fe053773908a907359db0ffc678138aaa89fc76ede229d9bcc048f170a759c1
|
|
| MD5 |
54a21bdd320a0948c51f9c685890eb1a
|
|
| BLAKE2b-256 |
4bc5d5f613338259de7e52851f6d377c5eae58366626b3bd499d343c042a53ea
|
File details
Details for the file datafilter-0.1.2-py2.py3-none-any.whl.
File metadata
- Download URL: datafilter-0.1.2-py2.py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a18aa407c17adff6902066fa6ee94225cbdce62673f9f451c8e211cd087159b
|
|
| MD5 |
8f59e4f118cecd0d3eb7e5dd2ca4559b
|
|
| BLAKE2b-256 |
52a6ec50058e3f63c50bd6131c02b5e599423e844d3781dbb83286324734f85f
|