Tool for searching data in possible malformed input data as preprocessing step for further analysis.
Project description
FindAnywhere is a tool designed for data analysts and developers facing the challenge of extracting meaningful information from poorly structured or malformed CSV files. This tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without needing to correct their format first, focusing efforts on smaller, more relevant subsets.
Example
Suppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore, the csv file has some escaping issues, generating extra columns.
username,address,email,notes
alice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,
bob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local
charlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,
We want to search the csv file for email addresses and some town information, but can't rely on the data being present where it should be. The information we seek is provided as json file:
[
{"id": "alice", "email": "alice.ashcroft@here.local", "town": "Ashville"},
{"id": "charlie", "email": "charlie.st.claire@here.local"}
]
After running findanywhere on the datasets we get the following results in the json lines format. The result file holds data sets that might be relevant to analyze further, without having to fix the original csv file, especially when problems might be hard to find in larger data sets.
{"of": "alice", "best_matches": {"email": {"position": {"line": 0, "column": "email"}, "value": "alice.ashcroft@here.local", "similarity": 1.0}, "id": {"position": {"line": 0, "column": "username"}, "value": "alice.ashcroft", "similarity": 0.8714285714285714}, "town": {"position": {"line": 0, "column": "address"}, "value": "5th Avenue Ashville", "similarity": 1.0}}, "score": 0.9571428571428572}
{"of": "charlie", "best_matches": {"email": {"position": {"line": 2, "column": "address"}, "value": "charlie.st.claire@here.local", "similarity": 1.0}, "id": {"position": {"line": 2, "column": "username"}, "value": "charlie.st.claire", "similarity": 0.8823529411764706}}, "score": 0.9411764705882353}
Alternatively, a direct search can be issued by using the command
findanywhere_search search_data.json input.csv \
--source tabular --threshold constant
--threshold-constant 0.8 --similarity jaro_winkler
Usage
Start by creating a schema to define the parameters for searching through your data:
findanywhere_schema tabular string_based_evaluation \
--threshold constant \
--out schema.yml
Edit the schema.yml file as needed, using the documentation to guide the configuration of options and methods.
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: token_best_fit_similarity
similarity_parameter: {}
name: string_based_evaluation
source:
config:
encoding: utf-8
errors: surrogateescape
name: tabular
threshold:
config:
constant: 0.9
name: constant
Run the tool against your datasets using the defined schema:
findanywhere schema.yml search_data.json garbage.csv --out result.json_line
Results will be stored in result.json_line. For additional commands and options, use the --help flag.
Installation
Install WhereIsIt easily using pip:
pip install findanywhere
Key Features
- Robust Malformed File Handling: Efficiently processes CSV files with irregular column structures or misplaced data entries.
- Fuzzy Matching Capabilities: Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.
- Parallel Processing Support: Leverages multiple processes to enhance performance on large datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file findanywhere-1.6.3-py3-none-any.whl
.
File metadata
- Download URL: findanywhere-1.6.3-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/5.15.154+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9119c1b3a6ae3ed84cbe55633634c452f56086a4229a00c33e44839daedce9bd |
|
MD5 | 37c01046a81dbfd5bab215255eb76e58 |
|
BLAKE2b-256 | 211c6326ef571b3082cdf3c0ed1905701d6f7ba67cb8367f7eec9e5752807cd3 |