Python package for probabilistic delimiter detection.
Project description
DelimiterFinder
DelimiterFinder is a Python package for probabilistic delimiter detection. It is a fast, efficient, and easy-to-use tool for identifying unknown delimiters within tabular data.
Key Features
-
Versatile: Detection of both single and multiple character delimiters.
-
Versatile: Supports tabular data stored in a variety of formats, including common tabular data format files (e.g., CSV, TSV, TXT) or Python
string
andlist
types. -
Robustness: Leverages Bayesian techniques to probabilistically identify unknown delimiters given data.
-
Robustness: Includes significance testing for all results.
-
Robustness: Robust to malformed data (not an "all or nothing approach" in the case of malformed rows).
-
Transparency: Reports posterior probabilities for all identified candidate delimiters.
-
Fast and efficient: Detect delimiters with a high level of confidence given just 10 rows.
Installation
Install the latest released version from PyPI.
pip install DelimiterFinder
User Guide
Parameters and methods for DelimiterFinder.finder.Finder
class DelimiterFinder.finder.Finder(ignore_chars=None)
Parameter | Type | Default | Optional | Description |
---|---|---|---|---|
ignore_chars |
list |
None | Yes | List of non-alphanumeric characters which should not be considered candidate delimiters. |
Attributes | Type | Description |
---|---|---|
posterior |
dict |
The posterior probability of each candidate delimiter. |
bayes_factor |
float |
Evidence in favor of the most likely delimiter (MAP) relative to the second most likely delimiter. |
Methods:
find(data, is_path=False, num_samples=20, new_line_sep="\n")
Parameter | Type | Default | Optional | Description |
---|---|---|---|---|
data |
str or list |
No | The input data either as a single string with each row separated by new_line_sep or a list where each element is a row. Alternatively, a path to a text file (e.g., .TXT, .CSV) may be passed, in which case, the is_path parameter should be set to "True" |
|
is_path |
bool |
False | Yes | An indicator for whether the value passed to the data parameter is a file path. |
num_samples |
int |
20 | :Yes | Number of rows to sample for inference. |
new_line_sep |
str |
"\n" | Yes | The new line separator for the rows in the data. |
Return | Type | Description |
---|---|---|
delim |
str |
The maximum a posteriori probability (MAP) estimate. |
Example
Using DelimiterFinder is easy. To get started, simply create an instance of the Finder
class and pass your data to the find
method. The example below walks through a simple implementation.
>>> from DelimiterFinder.finder import Finder
>>> # example data
>>> data = "c_1~|~c_2~|~c_3\n1~|~2~|~3\n4~|~~|~\n5~|~~|~6"
>>> # create instance of Finder and fit to data
>>> delim_locator = Finder()
>>> delim = delim_locator.find(data)
>>> # check the most likely delimiter
>>> print(delim)
~|~
>>> # check the probabilities for each delimiter
>>> print(delim_locator.posterior)
{'_': 0.022, '~|~': 0.977}
>>> # check the results of the significance test
>>> print(delim_locator.bayes_factor)
42.66
As we can see from the output above, the DelimiterFinder was able to identify an unknown three character long delimiter. The posterior
attribute provides a dictionary with all of the tested candidates delimiters and their associated posterior probabilities. The bayes_factor
attribute shows us that there is very strong evidence (i.e., a value greater than 10) in favor of the most likely delimiter relative to the second most likely delimiter. All with just 4 rows of data!
Indeed, DelimiterFinder can handle much more complicated data than the example given above, with the confidence in the decision made increasing with the number of rows provided. The DelimiterFinder has been tested for robustness against hundreds of randomly generated test cases. These tests can be found in the tests directory of the GitHub repo.
Bayesian Methods
Inference
DelimiterFinder leverages Bayesian techniques to probabilistically identify unknown delimiters given data. In particular, DelimiterFinder fits a model using sequential Bayesian updating.
The model is given as follows:
Here, theta is a finite set of candidate delimiters. Candidate delimiters are all contiguous strings of valid (i.e., not in the given ignore_chars
list) non-alphanumeric characters in the first row of data (assumed to be the header) The prior for these candidate delimiters is given by their relative frequencies. The variable X represents a row of data. The likelihood is the proportion of the number of columns in the header and number of columns in the given row of data, assuming delimiter theta is the true delimiter. Since this is a discrete distribution with a finite number of candidates delimiters, the denominator (normalization constant) is the sum over all thetas of the likelihood times prior.
The model is updated sequentially over M rows of data as follows:
The posterior probabilities from row N are used as priors in row N+1. This is implemented sequentially for all rows 1...N...M. Finally, the maximum a posteriori probability (MAP) estimate is taken to be the delimiter.
Hypothesis Testing
A Bayesian hypothesis test is used to evaluate the significance of the most likely delimiter. The framework for this hypothesis test is as follows: hypothesis one is that the delimiter with the highest posterior probability (MAP estimate) is the true delimiter, and hypothesis two is that the delimiter with the second highest posterior probability is the true delimiter. The more likely hypothesis one is than hypothesis two, the more confident we are with the model's choice for most likely delimiter.
To conduct this hypothesis test, we will calculate the Bayes factor, which is the ratio of likelihood between the two hypotheses.
The following rules are used to determine the significance of the results given the Bayes factor:
1.) Bayes factor = 1: no evidence.
2.) 1 < Bayes factor < 3: weak evidence.
3.) 3 < Bayes factor < 10: substantial evidence.
4.) Bayes factor > 10: strong evidence.
5.) Bayes factor < 1: not possible in this hypothesis test.
Source: Jeffreys, Harold (1998) [1961]. The Theory of Probability (3rd ed.). Oxford, England. p. 432.
DelimiterFinder will raise a warning if the Bayes factor for the chosen delimiter is less than 3. Increasing the number of rows or adding unwanted characters to the ignore_chars
list will generally increase the Bayes factor.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file DelimiterFinder-0.0.4.tar.gz
.
File metadata
- Download URL: DelimiterFinder-0.0.4.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec121799d6dfaaf5ed5486dc2f1532f0920c27d9703768f9a58862169fec2266 |
|
MD5 | 2bba20158791f1cea78e927e80066e3e |
|
BLAKE2b-256 | a822c8ed00b5a3fcbc050cd63c4156c688c12d3bea8f6b84e0093ccbfcf89cc7 |
File details
Details for the file DelimiterFinder-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: DelimiterFinder-0.0.4-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7a01b1fcc3314c7b1c5f5769ed04fd9b4a1712f6b44bd2b3cac928a6950f332 |
|
MD5 | cfe06553797d38feb4523a4b710dfb2c |
|
BLAKE2b-256 | df35c74ae704d54a1687d901e02f19deb205d3d527b3bc3fec335ba6a03e8b6e |