Skip to main content

A pure-Python library for extracting price and currency information from unstructured text

Project description

Pricetag

A pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.

Features

  • Zero Dependencies: Uses only Python standard library
  • High Performance: Processes 1000+ documents per second
  • Comprehensive Pattern Support:
    • Individual amounts: $50,000, $50k, 50k USD
    • Ranges: $50k-$75k, $50,000 to $75,000
    • Hourly/Annual/Monthly rates: $25/hour, $50k/year, $4k/month
    • Contextual terms: six figures, competitive, DOE
  • Smart Normalization: Converts all amounts to annual USD for easy comparison
  • Confidence Scoring: Each extraction includes a confidence score
  • Validation & Sanity Checks: Flags unusual or problematic values

Installation

pip install pricetag

Quick Start

from pricetag import PriceExtractor

# Initialize the extractor
extractor = PriceExtractor()

# Extract prices from text
text = "Senior Engineer position paying $120,000 - $150,000 annually"
results = extractor.extract(text)

# Access the results
for result in results:
    print(f"Value: {result['value']}")
    print(f"Type: {result['type']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Annual value: {result['normalized_annual']}")

Configuration Options

extractor = PriceExtractor(
    min_confidence=0.5,        # Minimum confidence score to include
    include_contextual=True,   # Extract terms like "six figures"
    normalize_to_annual=True,  # Convert to annual amounts
    min_salary=10000,          # Minimum reasonable salary
    max_salary=10000000,       # Maximum reasonable salary
    assume_hours_per_year=2080,# For hourly→annual conversion
    fast_mode=False,           # Enable for better performance
    max_results_per_text=None  # Limit number of results
)

Examples

Basic Salary Extraction

text = "The position offers $75,000 per year with benefits"
results = extractor.extract(text)
# Returns: [{'value': 75000.0, 'type': 'annual', ...}]

Hourly Rate with Normalization

text = "Paying $45/hour for senior developers"
results = extractor.extract(text)
# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]

Salary Range

text = "Salary range: $60k-$80k depending on experience"
results = extractor.extract(text)
# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]

Contextual Terms

text = "Looking for someone with 5+ years experience, six figures"
results = extractor.extract(text)
# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]

Multiple Prices

text = "Base: $100k, Bonus: up to $30k, Equity: 0.5%"
results = extractor.extract(text)
# Returns multiple results with appropriate flags

Batch Processing

texts = [
    "Salary: $50,000",
    "Rate: $30/hour", 
    "Competitive pay"
]
results_batch = extractor.extract_batch(texts)

Output Format

Each extraction returns a PriceResult dictionary:

{
    'value': float | tuple[float, float],  # Single value or (min, max)
    'raw_text': str,                       # Original matched text
    'position': tuple[int, int],           # Character positions
    'type': str,                           # 'hourly', 'annual', 'monthly', etc.
    'confidence': float,                   # 0.0 to 1.0
    'normalized_annual': float | tuple,    # Annual USD amount
    'currency': str,                       # Always 'USD' in v1
    'is_range': bool,                      # True for ranges
    'flags': list[str]                     # Validation flags
}

Validation Flags

  • invalid_range: Max less than min
  • below_minimum: Below configured threshold
  • above_maximum: Above configured threshold
  • unreasonable_hourly_rate: Outside $7-$500/hour
  • potential_inconsistency: Large discrepancy with other prices
  • ambiguous_type: Unclear if hourly/annual
  • approximate: Estimated from "approximately"
  • requires_market_data: Needs external data (e.g., "competitive")
  • experience_dependent: Depends on experience (e.g., "DOE")

Performance

The library is optimized for high-volume processing:

  • Pre-compiled regex patterns
  • Number parsing cache
  • Quick pre-filtering
  • Fast mode for bulk processing
  • Batch processing support
# Fast mode for high-volume processing
extractor = PriceExtractor(fast_mode=True, max_results_per_text=5)

# Process 1000 documents
texts = ["..." for _ in range(1000)]
results = extractor.extract_batch(texts)  # < 1 second

Testing

Run the test suite:

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=pricetag

Limitations

  • Currently supports USD only
  • Optimized for US salary formats
  • Context window limited to surrounding text
  • Does not handle equity/stock compensation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

Author

Michelle Pellon - mgracepellon@gmail.com

Acknowledgments

Built with pure Python for maximum compatibility and zero dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pricetag-1.0.0.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pricetag-1.0.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file pricetag-1.0.0.tar.gz.

File metadata

  • Download URL: pricetag-1.0.0.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pricetag-1.0.0.tar.gz
Algorithm Hash digest
SHA256 62119312c63a2678d9b1c65c819fca7b620111b7a98757d120b83f7d82d51346
MD5 d7fbcb8bd51e1823ec7b6e5a0a0f3e81
BLAKE2b-256 a3dfde79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d

See more details on using hashes here.

File details

Details for the file pricetag-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pricetag-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pricetag-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce18a9cfdf74492ef04273f043770516cd6c93acb1c4dc4b23854c490693273f
MD5 da8346498daa0b029c2e69480349e543
BLAKE2b-256 e00eef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page