Skip to main content

Tool for searching data in possible malformed input data as preprocessing step for further analysis.

Project description

FindAnywhere is a tool designed for data analysts and developers facing the challenge of extracting meaningful information from poorly structured or malformed CSV files. This tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without needing to correct their format first, focusing efforts on smaller, more relevant subsets.

Example

Suppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore, the csv file has some escaping issues, generating extra columns.

username,address,email,notes
alice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,
bob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local
charlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,

We want to search the csv file for email addresses and some town information, but can't rely on the data being present where it should be. The information we seek is provided as json file:

[
  {"id": "alice", "email": "alice.ashcroft@here.local", "town":  "Ashville"},
  {"id": "charlie", "email":  "charlie.st.claire@here.local"}
]

After running findanywhere on the datasets we get the following results in the json lines format. The result file holds data sets that might be relevant to analyze further, without having to fix the original csv file, especially when problems might be hard to find in larger data sets.

{"of": "alice", "best_matches": {"email": {"position": {"line": 0, "column": "email"}, "value": "alice.ashcroft@here.local", "similarity": 1.0}, "id": {"position": {"line": 0, "column": "username"}, "value": "alice.ashcroft", "similarity": 0.8714285714285714}, "town": {"position": {"line": 0, "column": "address"}, "value": "5th Avenue Ashville", "similarity": 1.0}}, "score": 0.9571428571428572}
{"of": "charlie", "best_matches": {"email": {"position": {"line": 2, "column": "address"}, "value": "charlie.st.claire@here.local", "similarity": 1.0}, "id": {"position": {"line": 2, "column": "username"}, "value": "charlie.st.claire", "similarity": 0.8823529411764706}}, "score": 0.9411764705882353}

Alternatively, a direct search can be issued by using the command

findanywhere_search search_data.json input.csv  \
--source tabular --threshold constant
--threshold-constant 0.8 --similarity jaro_winkler

Usage

Start by creating a schema to define the parameters for searching through your data:

findanywhere_schema tabular string_based_evaluation \
--threshold constant \
--out schema.yml

Edit the schema.yml file as needed, using the documentation to guide the configuration of options and methods.

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: token_best_fit_similarity
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    encoding: utf-8
    errors: surrogateescape
  name: tabular
threshold:
  config:
    constant: 0.9
  name: constant

Run the tool against your datasets using the defined schema:

findanywhere schema.yml search_data.json garbage.csv --out result.json_line

Results will be stored in result.json_line. For additional commands and options, use the --help flag.

Installation

Install WhereIsIt easily using pip:

pip install findanywhere

Key Features

  • Robust Malformed File Handling: Efficiently processes CSV files with irregular column structures or misplaced data entries.
  • Fuzzy Matching Capabilities: Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.
  • Parallel Processing Support: Leverages multiple processes to enhance performance on large datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

findanywhere-1.6.3-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file findanywhere-1.6.3-py3-none-any.whl.

File metadata

  • Download URL: findanywhere-1.6.3-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Linux/5.15.154+

File hashes

Hashes for findanywhere-1.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9119c1b3a6ae3ed84cbe55633634c452f56086a4229a00c33e44839daedce9bd
MD5 37c01046a81dbfd5bab215255eb76e58
BLAKE2b-256 211c6326ef571b3082cdf3c0ed1905701d6f7ba67cb8367f7eec9e5752807cd3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page