Quickly find tokens (words, phrases, etc) within your data.
Project description
Data Filter
Quickly find tokens (words, phrases, etc) within your data.
Data Filter is a lightweight data cleansing framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:
- CSV files
- Text files
- Text strings
Table of Contents
Requirements
- Python 3.6+
Installation
To install, simply use pipenv (or pip):
>>> pipenv install datafilter
Basic Usage
Each example below returns a generator that yields parsed data.
CSV
from datafilter import CSV
tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
data = CSV("test.csv", tokens=tokens)
print(next(data.results))
Text
from datafilter import Text
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, tokens=["Lorem"])
print(next(data.results))
Text File
from datafilter import TextFile
data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
print(next(data.results))
Features
Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:
- Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
- Filters that can handle incoming data from external APIs.
Base
Abstract base class that's subclassed by every filter.
Base
includes several methods to ensure data is properly normalized, formatted and returned. The results
property method is an @abstractmethod
to enforce its use in subclasses.
Parameters
tokens
type <list>
A list of strings that will be searched for within a set of data.
translations
type <list>
A list of strings that will be removed during normalization.
Default
['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', ' \t\n\r\x0b\x0c', '0123456789']
bidirectional
type <bool>
When True
, token matching will be bidirectional.
Default
True
Note:
A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.
caseinsensitive
type <bool>
When True
, tokens and data are converted to lowercase during normalization.
Default
True
Methods
results
Abstract method used to return processed results. This is defined within Base
subclasses.
makelower
Returns lowercase data.
Returns
type <str>
maketrans
Returns a translation table used during normalization.
Returns
type <dict>
normalize
A generator that yields normalized data. Normalization includes converting data to lowercase and removing strings.
Yields
type <dict>
Note:
Normalized data is returned in the following key/value format. While the key will always be a string, the value may be a string, list, dictionary or boolean.
{ "original": "", "normalized": "", }
parse
Returns parsed and property formatted data.
Returns
type <dict>
Example:
Assume we're searching for the token "Lorem" in a very short string.
data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"]) print(next(data.results))The returned result would be formatted as:
{ "data": "Lorem ipsum dolor sit amet", "flagged": True, "describe": { "tokens": { "detected": ["Lorem"], "count": 1, "frequency": { "Lorem": 1, }, } }, }
Filters
Filters subclass and extend the Base
class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.
CSV
Parameters
CSV
is a subclass of Base
and inherits all parameters.
path
type <str>
Path to a CSV file.
Methods
CSV
is a subclass of Base
and inherits all methods.
read_csv
Static method that accepts parameter stream
of type TextIO
and returns a generator that yields a list of CSV rows.
Yields
type <list>
Text
Parameters
Text
is a subclass of Base
and inherits all parameters.
text
type <str>
A text string.
re_split
type <str>
A regular expression pattern or string that will be applied to text
with re.split
before normalization.
Methods
Text
is a subclass of Base
and inherits all methods.
TextFile
Parameters
TextFile
is a subclass of Base
and inherits all parameters.
path
type <str>
Path to a text file.
re_split
type <str>
A regular expression pattern or string that will be applied to text
with re.split
before normalization.
Methods
TextFile
is a subclass of Base
and inherits all methods.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datafilter-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 588eeceafb17ec49a8bb0f7c166b429e67914ff180bb40e46b6ca6d4910dbd45 |
|
MD5 | af3a97e27ce16be76b1b049c0e99bfe1 |
|
BLAKE2b-256 | dbcf4df710f72684c68d962c6c2697070e8d576433f5f7d87f4e8a8f53eaa5e2 |