A pure-Python library for extracting price and currency information from unstructured text
Project description
Pricetag
A pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.
Features
- Zero Dependencies: Uses only Python standard library
- High Performance: Processes 1000+ documents per second
- Comprehensive Pattern Support:
- Individual amounts:
$50,000,$50k,50k USD - Ranges:
$50k-$75k,$50,000 to $75,000 - Hourly/Annual/Monthly rates:
$25/hour,$50k/year,$4k/month - Contextual terms:
six figures,competitive,DOE
- Individual amounts:
- Smart Normalization: Converts all amounts to annual USD for easy comparison
- Confidence Scoring: Each extraction includes a confidence score
- Validation & Sanity Checks: Flags unusual or problematic values
Installation
pip install pricetag
Quick Start
from pricetag import PriceExtractor
# Initialize the extractor
extractor = PriceExtractor()
# Extract prices from text
text = "Senior Engineer position paying $120,000 - $150,000 annually"
results = extractor.extract(text)
# Access the results
for result in results:
print(f"Value: {result['value']}")
print(f"Type: {result['type']}")
print(f"Confidence: {result['confidence']}")
print(f"Annual value: {result['normalized_annual']}")
Configuration Options
extractor = PriceExtractor(
min_confidence=0.5, # Minimum confidence score to include
include_contextual=True, # Extract terms like "six figures"
normalize_to_annual=True, # Convert to annual amounts
min_salary=10000, # Minimum reasonable salary
max_salary=10000000, # Maximum reasonable salary
assume_hours_per_year=2080,# For hourly→annual conversion
fast_mode=False, # Enable for better performance
max_results_per_text=None # Limit number of results
)
Examples
Basic Salary Extraction
text = "The position offers $75,000 per year with benefits"
results = extractor.extract(text)
# Returns: [{'value': 75000.0, 'type': 'annual', ...}]
Hourly Rate with Normalization
text = "Paying $45/hour for senior developers"
results = extractor.extract(text)
# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]
Salary Range
text = "Salary range: $60k-$80k depending on experience"
results = extractor.extract(text)
# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]
Contextual Terms
text = "Looking for someone with 5+ years experience, six figures"
results = extractor.extract(text)
# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]
Multiple Prices
text = "Base: $100k, Bonus: up to $30k, Equity: 0.5%"
results = extractor.extract(text)
# Returns multiple results with appropriate flags
Batch Processing
texts = [
"Salary: $50,000",
"Rate: $30/hour",
"Competitive pay"
]
results_batch = extractor.extract_batch(texts)
Output Format
Each extraction returns a PriceResult dictionary:
{
'value': float | tuple[float, float], # Single value or (min, max)
'raw_text': str, # Original matched text
'position': tuple[int, int], # Character positions
'type': str, # 'hourly', 'annual', 'monthly', etc.
'confidence': float, # 0.0 to 1.0
'normalized_annual': float | tuple, # Annual USD amount
'currency': str, # Always 'USD' in v1
'is_range': bool, # True for ranges
'flags': list[str] # Validation flags
}
Validation Flags
invalid_range: Max less than minbelow_minimum: Below configured thresholdabove_maximum: Above configured thresholdunreasonable_hourly_rate: Outside $7-$500/hourpotential_inconsistency: Large discrepancy with other pricesambiguous_type: Unclear if hourly/annualapproximate: Estimated from "approximately"requires_market_data: Needs external data (e.g., "competitive")experience_dependent: Depends on experience (e.g., "DOE")
Performance
The library is optimized for high-volume processing:
- Pre-compiled regex patterns
- Number parsing cache
- Quick pre-filtering
- Fast mode for bulk processing
- Batch processing support
# Fast mode for high-volume processing
extractor = PriceExtractor(fast_mode=True, max_results_per_text=5)
# Process 1000 documents
texts = ["..." for _ in range(1000)]
results = extractor.extract_batch(texts) # < 1 second
Testing
Run the test suite:
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=pricetag
Limitations
- Currently supports USD only
- Optimized for US salary formats
- Context window limited to surrounding text
- Does not handle equity/stock compensation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details
Author
Michelle Pellon - mgracepellon@gmail.com
Acknowledgments
Built with pure Python for maximum compatibility and zero dependencies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pricetag-1.0.0.tar.gz.
File metadata
- Download URL: pricetag-1.0.0.tar.gz
- Upload date:
- Size: 27.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62119312c63a2678d9b1c65c819fca7b620111b7a98757d120b83f7d82d51346
|
|
| MD5 |
d7fbcb8bd51e1823ec7b6e5a0a0f3e81
|
|
| BLAKE2b-256 |
a3dfde79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d
|
File details
Details for the file pricetag-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pricetag-1.0.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce18a9cfdf74492ef04273f043770516cd6c93acb1c4dc4b23854c490693273f
|
|
| MD5 |
da8346498daa0b029c2e69480349e543
|
|
| BLAKE2b-256 |
e00eef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b
|