Skip to main content

Library for pattern and anomalous pattern detection

Project description

https://img.shields.io/badge/License-BSD-green.svg
openclean Logo

About

This package identifies patterns and creates Openclean Patterns from data. It is part of the openclean-core library to create profiled results as well as to detect anomalies. Currently, Openclean Patterns support the following data types, but are fairly extensible to any other basic / nonbasic implementations:

  • Basic
    • String

    • Integers

    • Punctuations

    • Spaces

  • Non-Basic/Advanced
    • Dates
      • days of the week and months

    • Business Entities
      • using corporation suffixes

    • Geospatial Entities
      • using datamart-geo for administrative levels (in progress)

    • Address
      • USPS street abreviations and secondary unit designators for addresses

The package has been extended to identify anomalous patterns inside the data as well.

Installation

Install openclean-pattern from the Python Package Index (PyPI) using pip with:

pip install openclean-pattern

Usage

The library comes with many predefined classes to support the pattern detection process. One could use the OpencleanPatternFinder class or otherwise the general process should look similar to the following:

  1. Sample the column
    In case of very large dataset two Samplers have been added for the user’s convenience to help extract the distribution of the column:
    • RandomSampler: considers each item in the iterable equally probable to get selected

    • WeightedRandomSampler: takes a Counter of type {value:frequency} and creates a sample using the Counter distribution.

    • Distinct: selects only distinct rows

  2. Tokenize it to remove punctutation
    At this point TypeResolvers can also be injected to tokenize and encode in the same run instead of running it as a separate step 3:
    • RegexTokenizer: tokenizes using the default regex that breaks the row values into a list of tokens keeping the delimiters intact (unless a user provides a custom regex). It also changes the tokens to lower case letters. The user also has the option to define if they want to consider e.g. the string ‘a.b.c’ as delimited by the ‘.’ character or consider it as an abbreviation character and keep ‘abc’ intact.

    • DefaultTokenizer Follows the Regex Tokenizer process and the uses the DefaultTypeResolver to resolve token types.

  3. Resolve Types
    This stage converts the tokens to their Basic and Non-Basic representations:
    • BasicTypeResolver: converts the row into the above mentioned BasicTypes.

    • AdvancedTypeResolver: has numerous implementations and can be easily extended to add new AdvancedTypeResolver classes.
      • DateResolver

      • BusinessEntityResolver

      • AddressDesignatorResolver

      • GeoSpatialResolver

    • DefaultTypeResolver: does both Basic and Non-Basic type resolution by letting a user add Non-Basic interceptors before the Basic type resolution operation.

  4. Collect and/or Align
    Create groups of similar rows and align them:
    • Cluster: Collect similar tokenized rows by either clustering them using DBSCAN choosing a precomputed distance.

    • Group: Grouping tokenized rows with similar lengths

    • CombAlign [1]: looks at all the possible combinations of each token in each row with other all other rows, calculates the distance, clusters the closest alignments together using DBSCAN and returns the clustered groups.

  5. Compile a pattern
    Generate a regex pattern from the aligned groups:
    • DefaultRegexCompilerAnalyzes each token position and the different datatypes that appear at that position iterating through each row. Then selects the majority type as the pattern at that position. Combining positional regex’s compiles a full expression for the column.
      • method=col: Compiles the pattern based on the positions of different tokens at in each row. It flags values that don’t match the specific position’s majority types as anomalies.

      • method=row: Compiles the pattern using each full row as a possible pattern.

Upcoming Modules

  • serializer / deserializer

  • multiple sequence alignment

Examples

We include several notebooks in this repository that demonstrate openclean-pattern’s usage.

See also:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openclean_pattern-0.0.1.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

openclean_pattern-0.0.1-py3-none-any.whl (58.1 kB view details)

Uploaded Python 3

File details

Details for the file openclean_pattern-0.0.1.tar.gz.

File metadata

  • Download URL: openclean_pattern-0.0.1.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for openclean_pattern-0.0.1.tar.gz
Algorithm Hash digest
SHA256 328b29baac6b1dace25731d59377ce3d7ee28afc2ca0eb04050f8fed2a435cd9
MD5 94064dc480d25cde8a453b1fb275f1c8
BLAKE2b-256 8b73aacbdfffa286876a98693f1dce6b63c6e5e2906f1df87e7ff6017a55438d

See more details on using hashes here.

File details

Details for the file openclean_pattern-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: openclean_pattern-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 58.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for openclean_pattern-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 856ab27a9ce95b77f23ca06b4b4f2407926415d760e39e2dc9ccfba53427861e
MD5 421ebbea0abe635d0b4f80ee1fba52b5
BLAKE2b-256 340a650021e51ff0f82383ac499266cdc17ee55f1ec7edbdca31a01c84a29044

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page