Skip to main content

No project description provided

Project description

WhereIsIt is a tool designed for data analysts and developers facing the challenge of extracting meaningful information from poorly structured or malformed CSV files. This tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without needing to correct their format first, focusing efforts on smaller, more relevant subsets.

Example

Suppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore, the csv file has some escaping issues, generating extra columns.

username,address,email,notes
alice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,
bob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local
charlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,

We want to search the csv file for email addresses and some town information, but can't rely on the data being present where it should be. The information we seek is provided as json file:

[
  {"id": "alice", "email": "alice.ashcroft@here.local", "town":  "Ashville"},
  {"id": "charlie", "email":  "charlie.st.claire@here.local"}
]

After running whereisit on the datasets we get the following results in the json lines format. The result file holds data sets that might be relevant to analyze further, without having to fix the original csv file, especially when problems might be hard to find in larger data sets.

{"of": "alice", "best_matches": {"email": {"position": {"line": 0, "column": "email"}, "value": "alice.ashcroft@here.local", "similarity": 1.0}, "id": {"position": {"line": 0, "column": "username"}, "value": "alice.ashcroft", "similarity": 0.8714285714285714}, "town": {"position": {"line": 0, "column": "address"}, "value": "5th Avenue Ashville", "similarity": 1.0}}, "score": 0.9571428571428572}
{"of": "charlie", "best_matches": {"email": {"position": {"line": 2, "column": "address"}, "value": "charlie.st.claire@here.local", "similarity": 1.0}, "id": {"position": {"line": 2, "column": "username"}, "value": "charlie.st.claire", "similarity": 0.8823529411764706}}, "score": 0.9411764705882353}

Usage

Start by creating a schema to define the parameters for searching through your data:

whereisit_schema tabular string_based_evaluation \
--threshold constant \
--out schema.yml

Edit the schema.yml file as needed, using the documentation to guide the configuration of options and methods.

deduction:
  config: {}
  name: average
evaluation:
  config:
    aggregate: max
    similarity: token_best_fit_similarity
    similarity_parameter: {}
  name: string_based_evaluation
source:
  config:
    encoding: utf-8
    errors: surrogateescape
  name: tabular
threshold:
  config:
    constant: 0.9
  name: constant

Run the tool against your datasets using the defined schema:

findanywhere schema.yml search_data.json garbage.csv --out result.json_line

Results will be stored in result.json_line. For additional commands and options, use the --help flag.

Installation

Install WhereIsIt easily using pip:

pip install findanywhere

Key Features

  • Robust Malformed File Handling: Efficiently processes CSV files with irregular column structures or misplaced data entries.
  • Fuzzy Matching Capabilities: Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.
  • Parallel Processing Support: Leverages multiple processes to enhance performance on large datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findanywhere-1.0.0.tar.gz (46.6 kB view hashes)

Uploaded Source

Built Distribution

findanywhere-1.0.0-py3-none-any.whl (28.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page