Skip to main content

Python package for probabilistic delimiter detection.

Project description

DelimiterFinder

PyPI version codecov

DelimiterFinder is a Python package for probabilistic delimiter detection. It is a fast, efficient, and easy-to-use tool for identifying unknown delimiters within tabular data.

Key Features

  • Versatile: Detection of both single and multiple character delimiters.

  • Versatile: Supports tabular data stored in a variety of formats, including common tabular data format files (e.g., CSV, TSV, TXT) or Python string and list types.

  • Robustness: Leverages Bayesian techniques to probabilistically identify unknown delimiters given data.

  • Robustness: Includes significance testing for all results.

  • Robustness: Robust to malformed data (not an "all or nothing approach" in the case of malformed rows).

  • Transparency: Reports posterior probabilities for all identified candidate delimiters.

  • Fast and efficient: Detect delimiters with a high level of confidence given just 10 rows.

Installation

Install the latest released version from PyPI.

pip install DelimiterFinder

User Guide

Parameters and methods for DelimiterFinder.finder.Finder

class DelimiterFinder.finder.Finder(ignore_chars=None)
Parameter Type Default Optional Description
ignore_chars list None Yes List of non-alphanumeric characters which should not be considered candidate delimiters.
Attributes Type Description
posterior dict The posterior probability of each candidate delimiter.
bayes_factor float Evidence in favor of the most likely delimiter (MAP) relative to the second most likely delimiter.

Methods:

find(data, is_path=False, num_samples=20, new_line_sep="\n")
Parameter Type Default Optional Description
data str or list No The input data either as a single string with each row separated by new_line_sep or a list where each element is a row. Alternatively, a path to a text file (e.g., .TXT, .CSV) may be passed, in which case, the is_path parameter should be set to "True"
is_path bool False Yes An indicator for whether the value passed to the data parameter is a file path.
num_samples int 20 :Yes Number of rows to sample for inference.
new_line_sep str "\n" Yes The new line separator for the rows in the data.
Return Type Description
delim str The maximum a posteriori probability (MAP) estimate.

Example

Using DelimiterFinder is easy. To get started, simply create an instance of the Finder class and pass your data to the find method. The example below walks through a simple implementation.

>>> from DelimiterFinder.finder import Finder
>>> # example data
>>> data = "c_1~|~c_2~|~c_3\n1~|~2~|~3\n4~|~~|~\n5~|~~|~6"
>>> # create instance of Finder and fit to data
>>> delim_locator = Finder()
>>> delim = delim_locator.find(data)
>>> # check the most likely delimiter
>>> print(delim)
~|~
>>> # check the probabilities for each delimiter
>>> print(delim_locator.posterior)
{'_': 0.022, '~|~': 0.977}
>>> # check the results of the significance test
>>> print(delim_locator.bayes_factor)
42.66

As we can see from the output above, the DelimiterFinder was able to identify an unknown three character long delimiter. The posterior attribute provides a dictionary with all of the tested candidates delimiters and their associated posterior probabilities. The bayes_factor attribute shows us that there is very strong evidence (i.e., a value greater than 10) in favor of the most likely delimiter relative to the second most likely delimiter. All with just 4 rows of data!

Indeed, DelimiterFinder can handle much more complicated data than the example given above, with the confidence in the decision made increasing with the number of rows provided. The DelimiterFinder has been tested for robustness against hundreds of randomly generated test cases. These tests can be found in the tests directory of the GitHub repo.

Bayesian Methods

Inference

DelimiterFinder leverages Bayesian techniques to probabilistically identify unknown delimiters given data. In particular, DelimiterFinder fits a model using sequential Bayesian updating.

The model is given as follows:

Here, theta is a finite set of candidate delimiters. Candidate delimiters are all contiguous strings of valid (i.e., not in the given ignore_chars list) non-alphanumeric characters in the first row of data (assumed to be the header) The prior for these candidate delimiters is given by their relative frequencies. The variable X represents a row of data. The likelihood is the proportion of the number of columns in the header and number of columns in the given row of data, assuming delimiter theta is the true delimiter. Since this is a discrete distribution with a finite number of candidates delimiters, the denominator (normalization constant) is the sum over all thetas of the likelihood times prior.

The model is updated sequentially over M rows of data as follows:

The posterior probabilities from row N are used as priors in row N+1. This is implemented sequentially for all rows 1...N...M. Finally, the maximum a posteriori probability (MAP) estimate is taken to be the delimiter.

Hypothesis Testing

A Bayesian hypothesis test is used to evaluate the significance of the most likely delimiter. The framework for this hypothesis test is as follows: hypothesis one is that the delimiter with the highest posterior probability (MAP estimate) is the true delimiter, and hypothesis two is that the delimiter with the second highest posterior probability is the true delimiter. The more likely hypothesis one is than hypothesis two, the more confident we are with the model's choice for most likely delimiter.

To conduct this hypothesis test, we will calculate the Bayes factor, which is the ratio of likelihood between the two hypotheses.

The following rules are used to determine the significance of the results given the Bayes factor:

1.) Bayes factor = 1: no evidence.
2.) 1 < Bayes factor < 3: weak evidence.
3.) 3 < Bayes factor < 10: substantial evidence.
4.) Bayes factor > 10: strong evidence.
5.) Bayes factor < 1: not possible in this hypothesis test.

Source: Jeffreys, Harold (1998) [1961]. The Theory of Probability (3rd ed.). Oxford, England. p. 432.

DelimiterFinder will raise a warning if the Bayes factor for the chosen delimiter is less than 3. Increasing the number of rows or adding unwanted characters to the ignore_chars list will generally increase the Bayes factor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DelimiterFinder-0.0.4.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

DelimiterFinder-0.0.4-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file DelimiterFinder-0.0.4.tar.gz.

File metadata

  • Download URL: DelimiterFinder-0.0.4.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for DelimiterFinder-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ec121799d6dfaaf5ed5486dc2f1532f0920c27d9703768f9a58862169fec2266
MD5 2bba20158791f1cea78e927e80066e3e
BLAKE2b-256 a822c8ed00b5a3fcbc050cd63c4156c688c12d3bea8f6b84e0093ccbfcf89cc7

See more details on using hashes here.

File details

Details for the file DelimiterFinder-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: DelimiterFinder-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for DelimiterFinder-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a01b1fcc3314c7b1c5f5769ed04fd9b4a1712f6b44bd2b3cac928a6950f332
MD5 cfe06553797d38feb4523a4b710dfb2c
BLAKE2b-256 df35c74ae704d54a1687d901e02f19deb205d3d527b3bc3fec335ba6a03e8b6e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page