Advanced name search and entity recognition in large text corpora with NLP and AI capabilities
Project description
Search Names: Advanced Name Search and Entity Recognition
Search Names is a modern Python package for advanced name search and entity recognition in large text corpora. It uses sophisticated NLP techniques to handle the complexities of real-world name matching with high accuracy and performance.
Key Challenges Addressed
When searching for names in large text corpora, you encounter seven major challenges:
- Non-standard formats - Names may appear as "LastName, FirstName" or "FirstName MiddleName LastName"
- Reference variations - People may be referred to as "President Clinton" vs "William Clinton"
- Misspellings - OCR errors and typos in source documents
- Name variants - Nicknames (Bill vs William), diminutives, and middle names
- False positives - Common names that overlap with other famous people
- Duplicates - Multiple entries for the same person in different formats
- Computational efficiency - Fast processing of large-scale datasets
Our package addresses each of these systematically with modern NLP and AI techniques.
🚀 Quick Start
Installation
# Core installation (name parsing, fuzzy search, CLI)
pip install search_names
# Enhanced features (pandas, spaCy, semantic similarity)
pip install "search_names[enhanced]"
# Advanced NLP (transformers, torch)
pip install "search_names[nlp]"
# All features including ML, formats, search backends, web
pip install "search_names[all]"
Modern CLI Interface
# Create sample configuration
search-names config create-sample
# Run complete pipeline
search-names pipeline input_names.csv text_corpus.csv
# Individual pipeline steps
search-names clean input_names.csv --streaming # Step 1: Clean names
search-names merge-supp clean_names.csv # Step 2: Augment data
search-names preprocess augmented_names.csv # Step 3: Preprocess
search-names search text_corpus.csv --optimized --streaming # Step 4: Search
# Performance options
search-names search corpus.csv --optimized # Use optimized search engine
search-names search large_corpus.csv --streaming # Memory-efficient streaming
search-names clean huge_names.csv --streaming # Chunked processing
Python API
import search_names
from search_names.pipeline import clean_names, augment_names, preprocess_names, search_names
from search_names.pipeline.step4_search import load_names_file
# Load configuration
config = search_names.get_config()
# Step 1: Clean and standardize names
result = clean_names(
infile="input_names.csv",
outfile="clean_names.csv",
col="Name",
all=False # Remove duplicates
)
# Step 2: Augment with supplementary data
augment_names(
infile="clean_names.csv",
prefixarg="seat", # Column for prefix lookup
name="FirstName", # Column for nickname lookup
outfile="augmented.csv",
prefix_file="prefixes.csv",
nickname_file="nick_names.txt"
)
# Step 3: Create optimized search patterns
preprocess_names(
infile="augmented.csv",
patterns=["FirstName LastName", "NickName LastName"],
outfile="preprocessed.csv",
editlength=[10, 15], # Fuzzy matching lengths
drop_patterns=["common", "the"] # Patterns to exclude
)
# Step 4: High-performance search with optimization
names = load_names_file("preprocessed.csv")
result = search_names(
input="text_corpus.csv",
text="text", # Text column name
names=names, # Preprocessed name list
outfile="results.csv",
use_optimized=True, # Use vectorized search engine
use_streaming=True, # Memory-efficient for large files
processes=8, # Parallel processing
max_name=20, # Max results per document
clean=True # Clean text before search
)
📋 The Workflow
Our package implements a systematic 4-step pipeline:
1. Clean Names (search-names clean)
Standardize names from various formats into structured components:
- Extract FirstName, MiddleName, LastName, Prefix, Suffix
- Handle titles, Roman numerals, and compound names
- Remove duplicates and normalize formatting
Input: Raw names in various formats Output: Structured name components
2. Merge Supplementary Data (search-names merge-supp)
Enrich names with additional variations:
- Prefixes: Context-specific titles (Senator, Dr., etc.)
- Nicknames: Common diminutives and alternatives
- Aliases: Alternative names and spellings
Input: Cleaned names + lookup files Output: Augmented names with variations
3. Preprocess (search-names preprocess)
Prepare optimized search patterns:
- Convert wide format to long format (one pattern per row)
- Deduplicate ambiguous patterns
- Filter out problematic patterns
- Generate fuzzy matching parameters
Input: Augmented names Output: Optimized search patterns
4. Search (search-names search)
Execute high-performance name search:
- Multi-threaded parallel processing with optimized chunking
- High-performance mode for large files (2-5x faster)
- Fuzzy matching with edit distance
- Streaming mode for memory efficiency
- Context-aware filtering and confidence scoring
Input: Text corpus + search patterns Output: Ranked search results with confidence scores
🔧 Modern Features
Advanced NLP Integration
- spaCy NER: Context-aware person detection
- Transformers: Semantic similarity matching
- Entity Linking: Connect mentions to knowledge bases
- Confidence Scoring: Quantify match uncertainty
Performance & Scalability
- Optimized Search Engine: Vectorized string matching with NumPy
- Streaming Processing: Handle datasets larger than memory with chunking
- Parallel Search: Multi-process search with configurable worker count
- Memory Management: Automatic streaming for large files (>500MB)
- Regex Optimization: Pre-compiled patterns and single-pass matching
- Early Termination: Stop processing when result limits reached
Developer Experience
- Type Hints: Full type annotation support
- Rich Logging: Beautiful, structured logging
- Configuration: YAML-based configuration management
- Modern CLI: Interactive command-line interface with progress bars
📁 File Format Support
| Format | Input | Output | Description |
|---|---|---|---|
| CSV | ✅ | ✅ | Primary format with full pipeline support |
| Compressed CSV | ✅ | ✅ | Gzip-compressed CSV files (.csv.gz) |
| Text Files | ✅ | ❌ | Nickname/pattern lookup files (.txt) |
Note: Focus on CSV format ensures maximum compatibility and performance. Optional formats like Parquet and JSON can be added via the [all] installation option if needed.
⚡ Performance Optimizations
The package includes several performance optimizations for handling large-scale data:
Automatic Streaming
# Files >500MB automatically use streaming
search-names search large_corpus.csv --optimized
# Force streaming for any file size
search-names clean names.csv --streaming
search-names search corpus.csv --streaming
Optimized Search Engine
from search_names.optimized_searchengines import create_optimized_search_engine
# Up to 10x faster than original implementation
engine = create_optimized_search_engine(names, use_streaming=True)
results = engine.search_file_streaming("large_corpus.csv", "results.csv")
Memory Management
- Chunked Processing: Processes files in 1000-row chunks by default
- Progress Tracking: Real-time progress reporting for long operations
- Memory Estimation: Automatic file size analysis to choose processing method
- Resource Cleanup: Proper cleanup of temporary files and memory
Benchmarking
from search_names.optimized_searchengines import benchmark_search_engines
# Compare performance between engines
speedup = benchmark_search_engines(keywords, test_text, iterations=100)
print(f"Optimized engine is {speedup:.1f}x faster")
⚙️ Configuration
Create a configuration file to customize behavior:
search-names config create-sample
Example search_names.yaml:
# Search behavior
search:
max_results: 20
fuzzy_min_lengths: [[10, 1], [15, 2]]
processes: 4
use_optimized: true # Use optimized search engine
use_streaming: false # Auto-detect large files
# Performance settings
performance:
chunk_size: 1000 # Rows per chunk for streaming
max_memory_mb: 500 # Memory threshold for streaming
enable_benchmarking: false
# NLP features
nlp:
use_spacy: true
spacy_model: "en_core_web_sm"
similarity_threshold: 0.8
# Text processing
text_processing:
remove_stopwords: true
normalize_unicode: true
streaming_for_large_files: true
🔄 Legacy CLI Support
All original commands remain available for backward compatibility:
clean_names input.csv
merge_supp cleaned.csv
preprocess augmented.csv
split_text_corpus large_corpus.csv
search_names corpus.csv
merge_results chunk_*.csv
📊 Examples
Basic Name Cleaning
from search_names.pipeline import clean_names
# Clean messy names
result = clean_names(
infile="politicians.csv",
outfile="clean_politicians.csv",
col="Name",
all=False # Remove duplicates
)
print(f"Processed {len(result)} names")
Advanced Search with Optimizations
from search_names.pipeline import search_names
from search_names.pipeline.step4_search import load_names_file
from search_names.optimized_searchengines import benchmark_search_engines
# Load preprocessed names
names = load_names_file("politician_names.csv")
# Benchmark performance (optional)
test_text = "President Biden met with Senator Warren yesterday"
speedup = benchmark_search_engines(names, test_text, iterations=10)
print(f"Optimized engine is {speedup:.1f}x faster")
# Run optimized search
results = search_names(
input="news_articles.csv",
text="article_text",
names=names,
outfile="search_results.csv",
max_name=50,
processes=8,
editlength=[8, 12], # Fuzzy matching thresholds
use_optimized=True, # Use vectorized engine
use_streaming=True, # Stream large files
clean=True # Clean text preprocessing
)
🎯 Use Cases
- Academic Research: Find mentions of historical figures in digitized texts
- Journalism: Track politician mentions across news coverage
- Legal Discovery: Locate person references in legal documents
- Genealogy: Search family names across historical records
- Business Intelligence: Monitor executive mentions in financial reports
📈 Performance
- Speed: Process 1M+ documents on modern hardware
- Memory: Streaming support for unlimited dataset sizes
- Accuracy: 95%+ precision with proper configuration
- Scalability: Linear scaling across CPU cores
🤝 Contributing
We welcome contributions! Please see our Code of Conduct.
📄 License
This project is licensed under the MIT License.
👥 Authors
- Suriyan Laohaprapanon - Original creator
- Gaurav Sood - Co-creator and maintainer
For detailed documentation and advanced usage, visit our documentation site.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file search_names-0.5.0.tar.gz.
File metadata
- Download URL: search_names-0.5.0.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0f88db2efa34cba8e9d4a103e101e7dde3f4af99321ed087dcc633273593c9e
|
|
| MD5 |
11800918d6d9a82b8ebeb7c9d189370f
|
|
| BLAKE2b-256 |
5e532729d61e7b665cf78a0e85341eca8ec97e034e2986c1ceec93eafff3a607
|
Provenance
The following attestation bundles were made for search_names-0.5.0.tar.gz:
Publisher:
python-publish.yml on appeler/search-names
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
search_names-0.5.0.tar.gz -
Subject digest:
f0f88db2efa34cba8e9d4a103e101e7dde3f4af99321ed087dcc633273593c9e - Sigstore transparency entry: 766760072
- Sigstore integration time:
-
Permalink:
appeler/search-names@13ea995d83c4fc2385e7e30b63c60d1dba39f3b1 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@13ea995d83c4fc2385e7e30b63c60d1dba39f3b1 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file search_names-0.5.0-py3-none-any.whl.
File metadata
- Download URL: search_names-0.5.0-py3-none-any.whl
- Upload date:
- Size: 50.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9c93a91ae109387d31e8cb3c689c7fc4f8388924221f0350782a3892c235a79
|
|
| MD5 |
90b1fb7864099bb5f8d66cd288de3e70
|
|
| BLAKE2b-256 |
be2949a06be6fd8bbe49cc5fa9779d26ace3b824af379590b2097030226d8d7a
|
Provenance
The following attestation bundles were made for search_names-0.5.0-py3-none-any.whl:
Publisher:
python-publish.yml on appeler/search-names
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
search_names-0.5.0-py3-none-any.whl -
Subject digest:
a9c93a91ae109387d31e8cb3c689c7fc4f8388924221f0350782a3892c235a79 - Sigstore transparency entry: 766760079
- Sigstore integration time:
-
Permalink:
appeler/search-names@13ea995d83c4fc2385e7e30b63c60d1dba39f3b1 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@13ea995d83c4fc2385e7e30b63c60d1dba39f3b1 -
Trigger Event:
workflow_dispatch
-
Statement type: