Skip to main content

Tool for searching data in possible malformed input data as preprocessing step for further analysis.

Project description

FindAnywhere is a tool designed for data analysts and developers facing the challenge of extracting meaningful information from poorly structured or malformed CSV files. This tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without needing to correct their format first, focusing efforts on smaller, more relevant subsets.

Example

Suppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore, the csv file has some escaping issues, generating extra columns.

username,address,email,notes
alice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,
bob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local
charlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,

We want to search the csv file for email addresses and some town information, but can't rely on the data being present where it should be. The information we seek is provided as json file:

[
  {"id": "alice", "email": "alice.ashcroft@here.local", "town":  "Ashville"},
  {"id": "charlie", "email":  "charlie.st.claire@here.local"}
]

After running findanywhere on the datasets we get the following results in the json lines format. The result file holds data sets that might be relevant to analyze further, without having to fix the original csv file, especially when problems might be hard to find in larger data sets.

{"of": "alice", "best_matches": {"email": {"position": {"line": 0, "column": "email"}, "value": "alice.ashcroft@here.local", "similarity": 1.0}, "id": {"position": {"line": 0, "column": "username"}, "value": "alice.ashcroft", "similarity": 0.8714285714285714}, "town": {"position": {"line": 0, "column": "address"}, "value": "5th Avenue Ashville", "similarity": 1.0}}, "score": 0.9571428571428572}
{"of": "charlie", "best_matches": {"email": {"position": {"line": 2, "column": "address"}, "value": "charlie.st.claire@here.local", "similarity": 1.0}, "id": {"position": {"line": 2, "column": "username"}, "value": "charlie.st.claire", "similarity": 0.8823529411764706}}, "score": 0.9411764705882353}

Alternatively, a direct search can be issued by using the command

findanywhere_search search_data.json input.csv  \
--source tabular --threshold constant
--threshold-constant 0.8 --similarity jaro_winkler

Usage

Start by creating a schema to define the parameters for searching through your data:

findanywhere_schema tabular string_based_evaluation \
--threshold constant \
--out schema.yml

Edit the schema.yml file as needed, using the documentation to guide the configuration of options and methods.

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: token_best_fit_similarity
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    encoding: utf-8
    errors: surrogateescape
  name: tabular
threshold:
  config:
    constant: 0.9
  name: constant

Run the tool against your datasets using the defined schema:

findanywhere schema.yml search_data.json garbage.csv --out result.json_line

Results will be stored in result.json_line. For additional commands and options, use the --help flag.

Installation

Install WhereIsIt easily using pip:

pip install findanywhere

Key Features

  • Robust Malformed File Handling: Efficiently processes CSV files with irregular column structures or misplaced data entries.
  • Fuzzy Matching Capabilities: Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.
  • Parallel Processing Support: Leverages multiple processes to enhance performance on large datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

findanywhere-1.6.4-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file findanywhere-1.6.4-py3-none-any.whl.

File metadata

  • Download URL: findanywhere-1.6.4-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Linux/5.15.154+

File hashes

Hashes for findanywhere-1.6.4-py3-none-any.whl
Algorithm Hash digest
SHA256 adf603dae58303ddf5e978f72a73b1e752c90c3b9c7384ee6140634368db8e39
MD5 ee55d13ac344749bd2f36b93b227a22f
BLAKE2b-256 7046db519dab4d43a43037da7db4bbc67c8e7c7de0656da971e8a8b95032601b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page