Skip to main content

Library for clustering entities based on numeric or text value.

Project description

pygrouplib

Python lightweight library for entities clustering

Quick start

from pygrouplib import NumericGrouper, TextGrouper

# Example data list
employees = []
employees.append({'name':'John','title':'Cardiologist','age':46})
employees.append({'name':'Ryan','title':'Cardiology','age':34})
employees.append({'name':'Kate','title':'Child Cardiologist', 'age':56})
employees.append({'name':'Anna','title':'Neurology', 'age':33})
employees.append({'name':'Mike','title':'Neurologist', 'age':38})

# Group by title, ignoring "Child" and allowing 1 different character for each 5 characters in title.
tg = TextGrouper()
groups = tg.group(employees, key=lambda x:x['title'], chars_per_error=5, ignore_list=['Child'])
print(*groups, sep='\n')

''' 
[{'name': 'John', 'title': 'Cardiologist', 'age': 46}, {'name': 'Ryan', 'title': 'Cardiology', 'age': 34}, {'name': 'Kate', 'title': 'Child Cardiologist', 'age': 56}]
[{'name': 'Mike', 'title': 'Neurologist', 'age': 38}, {'name': 'Anna', 'title': 'Neurology', 'age': 33}]
'''

# Group by age into 3 subgroups
ng = NumericGrouper()
groups = ng.group(employees, key=lambda x:x['age'], groups=3)
print(*groups, sep="\n")

'''
[{'name': 'Anna', 'title': 'Neurology', 'age': 33}, {'name': 'Ryan', 'title': 'Cardiology', 'age': 34}, {'name': 'Mike', 'title': 'Neurologist', 'age': 38}]
[{'name': 'John', 'title': 'Cardiologist', 'age': 46}]
[{'name': 'Kate', 'title': 'Child Cardiologist', 'age': 56}]
'''

Installation

Pygrouplib is published through PyPi so you can install it with easy_install or pip. The package name is pygrouplib, and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3).

$ easy_install pygrouplib
$ pip install pygrouplib

Documentation

NumericGrouper

group()

  • Groups elements into soubgroups based on numeric value.
  • Arguments:
    • entities - List of entities to be divided into groups.
    • groups - Number of resulting subgroups. The default value is None, in which case it is calculated based on provided values from entities.
    • key Function of one argument that is used to extract comparison key from each element in iterable (for example, key=lambda x: x['value']). The default value is None (compare the elements directly).
  • Returns a list of entities grouped into lists.

TextGrouper

group()

  • Groups elements into soubgroups based on text value. Similarity is calculated using Levenshtein algorithm.
  • Arguments:
    • entities - List of elements to be divided into groups.
    • similarity_limit - Maximum Levenshtein distance between words to be consiedered as similar. The default value is calculated as 1 + (1 for each chars_per_error characters).
    • chars_per_error - Number of characters per 1 error allowed. Levenshtein distance is considered as a number of errors. The default value is 8 (1 error is allowed for each 8 characters).
    • ignore_list - List of patterns to ignore when calculating text similarity. For example, with ignore_list=['\\d'], 'word123' and '123word45' are considered equal.
    • key - Function of one argument that is used to extract comparison key from each element in iterable (for example, key=str.lower). The default value is None (compare the elements directly).
  • Returns a list of entities grouped into lists.

levenshtein_distance()

  • Calculates Leveshtein distance between two strings. Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. Comparison is case sensitive.
  • Arguments:
    • s1, s2 - Strings to be compared. Leading and trailing spaces are ignored.
    • ignore_list - List of patterns to be ignored when comparing strings. For example, with ignore_list=['\\d'], distance between 'word123' and '123word45' is 0. Default value is empty list.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygrouplib-1.0.4.tar.gz (5.4 kB view details)

Uploaded Source

File details

Details for the file pygrouplib-1.0.4.tar.gz.

File metadata

  • Download URL: pygrouplib-1.0.4.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.2

File hashes

Hashes for pygrouplib-1.0.4.tar.gz
Algorithm Hash digest
SHA256 4e7fa5381d0a71e11a9becfddf774169089b7312c646e157340df80769fc6952
MD5 f47befe08ef3adeb0f3c0277f315bde4
BLAKE2b-256 efe9b457407e66f1b03b00750e402fe980403dbda0ac72ec5452d9cd480e9fc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page