Skip to main content

FuzzTypes is a Pydantic extension for annotating autocorrecting fields

Project description

FuzzTypes

FuzzTypes is a set of "autocorrecting" annotation types that expands upon Pydantic's included data conversions. Designed for simplicity, it provides powerful normalization capabilities (e.g. named entity linking) to ensure structured data is composed of "smart things" not "dumb strings".

Basic Use Case

todo compare and contrast with default Pydantic data conversion

Structured Data Generation Use Case

Several libraries (e.g. Instructor, Outlines, Marvin) use Pydantic to define models for structured data generation using Large Language Models (LLMs) via function calling or a grammar/regex based sampling approach based on the JSON schema generated by Pydantic.

This approach allows for the enumeration of allowed values using Python's Literal, Enum or JSON Schema's examples field directly in your Pydantic class declaration which is used by the LLM to generate valid values. This approach works exceptionally well for low-cardinality (not many unique allowed values) such as the world's continents (7 in total).

This approach, however, doesn't scale well for high-cardinality (many unique allowed values) such as the number of known human genomic variants (~325M). Where exactly the cutoff is between "low" and "high" cardinality is an exercise left to the reader and their use case.

That's where FuzzTypes come in. The allowed values are managed by the FuzzTypes annotations and the values are resolved during the Pydantic validation process.

Base Types

type description
Alias Match by name or alias.
Function Match by calling a custom function.
Fuzz Match by name or alias via fuzzy string similarity using RapidFuzz.
Hybrid Match by name or alias via reciprocal rank fusion of semantic and fuzzy similarity.
Name Match by name only.
Regex Match by regular expression pattern using re standard library.
Semantic Match by name or alias via vector-based semantic similarity using PyNNDescent.
Typeahead Match by name or alias prefix via Trie lookups with fuzzy or semantic fallback.

Usable Types

Type Description
ASCII Convert Unicode string to ASCII equivalent using anyascii.
Airport Represents airport names (e.g., O'Hare International Airport) for detailed aviation-related data.
AirportCode Manages airport codes (e.g., ORD) for quick and standardized airport identification.
CleanURL Normalized URL with trackers removed using url-normalize.
Country Represents country names, such as Germany or United States, for standardized country identification.
CountryCode Handles ISO country codes (e.g., DE, UK, US) for concise representation of countries.
Currency Handles currency codes (e.g., USD) for financial transactions and currency representation.
Date Convert date strings to Date object using DateParser.
Email Regex for extracting a single valid email from a string.
Emoji Matches emojis based on Unicode Consortium aliases. Utilizes the Emoji project for matching.
Integer Convert number or ordinal text to an int using NumberParser.
Language Manages full language names (e.g., English, German) for clear language specification.
LanguageCode Deals with ISO language codes (e.g., en, de) for brief language identification.
Person Parse human name into subfields (e.g. first, last, suffix) using python-nameparser.
Quantity Converts strings to Quantity objects, combining value and unit of measurement, via Pint.
SSN Regex for extracting a single social security number from a string.
Time Convert date time strings to DateTime object using DateParser.
USState Represents U.S. state names (e.g., Ohio) for detailed geographical categorization within the United States.
USStateCode Manages U.S. state codes (e.g., OH) for abbreviated state representation.
Zipcode Regex for extracting a 5 or 9 digit zipcode from a string.

Common Arguments

argument type description
case_sensitive bool If False, matches regardless of case. If True, matches only if case is exact. Default False.
examples list Example values used in schema generation.
notfound_mode Literal raise: Raises an error if key not found. none: Returns None if key not found. allow: Returns key if not found.
tiebreaker_mode Literal raise: Raises error if tied (value, priority). lesser: Returns lower value answer. greater: Returns greater value answer.
validator_mode str before: Resolves value before validation. Currently the only tested option.

Lazy Dependencies

FuzzTypes leverages several powerful libraries to extend its functionality.

These dependencies are not installed by default with FuzzTypes to keep the installation lightweight. Instead, they are optional and can be installed as needed depending on which types you use.

Below is a list of these dependencies, including their licenses and what specific Types require them.

Type Dependency License Usage
ASCII anyascii ISC An alternative to unidecode for Unicode to ASCII conversion, offering extensive character mapping.
ASCII unidecode GPL Converts Unicode strings to their ASCII equivalents, providing broad character support with minimal size.
Date dateparser BSD-3 Parses date strings in almost any string formats to Date objects, supporting multiple locales.
Emoji emoji BSD Matches emojis based on Unicode Consortium aliases, enhancing text processing with emoji support.
Fuzz rapidfuzz MIT Performs fuzzy string matching to find close matches to names or aliases with high performance.
Integer number-parser BSD-3 Converts number or ordinal text to integers, handling both written and numerical forms.
Person nameparser LGPL Parses human names into subfields (e.g., first, last, suffix), aiding in structured name handling.
Semantic pynndescent MIT Fast Approximate Nearest Neighbors library for retrieving similar text.
Semantic sentence-transformers MIT Default embedding library for encoding text into dense vector embeddings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzztypes-0.0.1.tar.gz (27.1 kB view hashes)

Uploaded Source

Built Distribution

fuzztypes-0.0.1-py3-none-any.whl (23.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page