Skip to main content

Advanced name search and entity recognition in large text corpora with NLP and AI capabilities

Project description

Search Names: Advanced Name Search and Entity Recognition

CI PyPI version Python versions Downloads

Search Names is a modern Python package for advanced name search and entity recognition in large text corpora. It uses sophisticated NLP techniques to handle the complexities of real-world name matching with high accuracy and performance.

Key Challenges Addressed

When searching for names in large text corpora, you encounter seven major challenges:

  1. Non-standard formats - Names may appear as "LastName, FirstName" or "FirstName MiddleName LastName"
  2. Reference variations - People may be referred to as "President Clinton" vs "William Clinton"
  3. Misspellings - OCR errors and typos in source documents
  4. Name variants - Nicknames (Bill vs William), diminutives, and middle names
  5. False positives - Common names that overlap with other famous people
  6. Duplicates - Multiple entries for the same person in different formats
  7. Computational efficiency - Fast processing of large-scale datasets

Our package addresses each of these systematically with modern NLP and AI techniques.

🚀 Quick Start

Installation

# Core installation (name parsing, fuzzy search, CLI)
pip install search_names

# Enhanced features (pandas, spaCy, semantic similarity)
pip install "search_names[enhanced]"

# Advanced NLP (transformers, torch)
pip install "search_names[nlp]"

# All features including ML, formats, search backends, web
pip install "search_names[all]"

Modern CLI Interface

# Create sample configuration
search-names config create-sample

# Run complete pipeline
search-names pipeline input_names.csv text_corpus.csv

# Individual pipeline steps
search-names clean input_names.csv --streaming           # Step 1: Clean names
search-names merge-supp clean_names.csv                  # Step 2: Augment data
search-names preprocess augmented_names.csv              # Step 3: Preprocess
search-names search text_corpus.csv --optimized --streaming  # Step 4: Search

# Performance options
search-names search corpus.csv --optimized               # Use optimized search engine
search-names search large_corpus.csv --streaming         # Memory-efficient streaming
search-names clean huge_names.csv --streaming            # Chunked processing

Python API

import search_names
from search_names.pipeline import clean_names, augment_names, preprocess_names, search_names
from search_names.pipeline.step4_search import load_names_file

# Load configuration
config = search_names.get_config()

# Step 1: Clean and standardize names
result = clean_names(
    infile="input_names.csv",
    outfile="clean_names.csv",
    col="Name",
    all=False  # Remove duplicates
)

# Step 2: Augment with supplementary data
augment_names(
    infile="clean_names.csv",
    prefixarg="seat",           # Column for prefix lookup
    name="FirstName",           # Column for nickname lookup
    outfile="augmented.csv",
    prefix_file="prefixes.csv",
    nickname_file="nick_names.txt"
)

# Step 3: Create optimized search patterns
preprocess_names(
    infile="augmented.csv",
    patterns=["FirstName LastName", "NickName LastName"],
    outfile="preprocessed.csv",
    editlength=[10, 15],        # Fuzzy matching lengths
    drop_patterns=["common", "the"]  # Patterns to exclude
)

# Step 4: High-performance search with optimization
names = load_names_file("preprocessed.csv")
result = search_names(
    input="text_corpus.csv",
    text="text",               # Text column name
    names=names,               # Preprocessed name list
    outfile="results.csv",
    use_optimized=True,        # Use vectorized search engine
    use_streaming=True,        # Memory-efficient for large files
    processes=8,               # Parallel processing
    max_name=20,               # Max results per document
    clean=True                 # Clean text before search
)

📋 The Workflow

Our package implements a systematic 4-step pipeline:

1. Clean Names (search-names clean)

Standardize names from various formats into structured components:

  • Extract FirstName, MiddleName, LastName, Prefix, Suffix
  • Handle titles, Roman numerals, and compound names
  • Remove duplicates and normalize formatting

Input: Raw names in various formats Output: Structured name components

2. Merge Supplementary Data (search-names merge-supp)

Enrich names with additional variations:

  • Prefixes: Context-specific titles (Senator, Dr., etc.)
  • Nicknames: Common diminutives and alternatives
  • Aliases: Alternative names and spellings

Input: Cleaned names + lookup files Output: Augmented names with variations

3. Preprocess (search-names preprocess)

Prepare optimized search patterns:

  • Convert wide format to long format (one pattern per row)
  • Deduplicate ambiguous patterns
  • Filter out problematic patterns
  • Generate fuzzy matching parameters

Input: Augmented names Output: Optimized search patterns

4. Search (search-names search)

Execute high-performance name search:

  • Multi-threaded parallel processing with optimized chunking
  • High-performance mode for large files (2-5x faster)
  • Fuzzy matching with edit distance
  • Streaming mode for memory efficiency
  • Context-aware filtering and confidence scoring

Input: Text corpus + search patterns Output: Ranked search results with confidence scores

🔧 Modern Features

Advanced NLP Integration

  • spaCy NER: Context-aware person detection
  • Transformers: Semantic similarity matching
  • Entity Linking: Connect mentions to knowledge bases
  • Confidence Scoring: Quantify match uncertainty

Performance & Scalability

  • Optimized Search Engine: Vectorized string matching with NumPy
  • Streaming Processing: Handle datasets larger than memory with chunking
  • Parallel Search: Multi-process search with configurable worker count
  • Memory Management: Automatic streaming for large files (>500MB)
  • Regex Optimization: Pre-compiled patterns and single-pass matching
  • Early Termination: Stop processing when result limits reached

Developer Experience

  • Type Hints: Full type annotation support
  • Rich Logging: Beautiful, structured logging
  • Configuration: YAML-based configuration management
  • Modern CLI: Interactive command-line interface with progress bars

📁 File Format Support

Format Input Output Description
CSV Primary format with full pipeline support
Compressed CSV Gzip-compressed CSV files (.csv.gz)
Text Files Nickname/pattern lookup files (.txt)

Note: Focus on CSV format ensures maximum compatibility and performance. Optional formats like Parquet and JSON can be added via the [all] installation option if needed.

⚡ Performance Optimizations

The package includes several performance optimizations for handling large-scale data:

Automatic Streaming

# Files >500MB automatically use streaming
search-names search large_corpus.csv --optimized

# Force streaming for any file size
search-names clean names.csv --streaming
search-names search corpus.csv --streaming

Optimized Search Engine

from search_names.optimized_searchengines import create_optimized_search_engine

# Up to 10x faster than original implementation
engine = create_optimized_search_engine(names, use_streaming=True)
results = engine.search_file_streaming("large_corpus.csv", "results.csv")

Memory Management

  • Chunked Processing: Processes files in 1000-row chunks by default
  • Progress Tracking: Real-time progress reporting for long operations
  • Memory Estimation: Automatic file size analysis to choose processing method
  • Resource Cleanup: Proper cleanup of temporary files and memory

Benchmarking

from search_names.optimized_searchengines import benchmark_search_engines

# Compare performance between engines
speedup = benchmark_search_engines(keywords, test_text, iterations=100)
print(f"Optimized engine is {speedup:.1f}x faster")

⚙️ Configuration

Create a configuration file to customize behavior:

search-names config create-sample

Example search_names.yaml:

# Search behavior
search:
  max_results: 20
  fuzzy_min_lengths: [[10, 1], [15, 2]]
  processes: 4
  use_optimized: true      # Use optimized search engine
  use_streaming: false     # Auto-detect large files

# Performance settings
performance:
  chunk_size: 1000         # Rows per chunk for streaming
  max_memory_mb: 500       # Memory threshold for streaming
  enable_benchmarking: false

# NLP features
nlp:
  use_spacy: true
  spacy_model: "en_core_web_sm"
  similarity_threshold: 0.8

# Text processing
text_processing:
  remove_stopwords: true
  normalize_unicode: true
  streaming_for_large_files: true

🔄 Legacy CLI Support

All original commands remain available for backward compatibility:

clean_names input.csv
merge_supp cleaned.csv
preprocess augmented.csv
split_text_corpus large_corpus.csv
search_names corpus.csv
merge_results chunk_*.csv

📊 Examples

Basic Name Cleaning

from search_names.pipeline import clean_names

# Clean messy names
result = clean_names(
    infile="politicians.csv",
    outfile="clean_politicians.csv",
    col="Name",
    all=False  # Remove duplicates
)
print(f"Processed {len(result)} names")

Advanced Search with Optimizations

from search_names.pipeline import search_names
from search_names.pipeline.step4_search import load_names_file
from search_names.optimized_searchengines import benchmark_search_engines

# Load preprocessed names
names = load_names_file("politician_names.csv")

# Benchmark performance (optional)
test_text = "President Biden met with Senator Warren yesterday"
speedup = benchmark_search_engines(names, test_text, iterations=10)
print(f"Optimized engine is {speedup:.1f}x faster")

# Run optimized search
results = search_names(
    input="news_articles.csv",
    text="article_text",
    names=names,
    outfile="search_results.csv",
    max_name=50,
    processes=8,
    editlength=[8, 12],        # Fuzzy matching thresholds
    use_optimized=True,        # Use vectorized engine
    use_streaming=True,        # Stream large files
    clean=True                 # Clean text preprocessing
)

🎯 Use Cases

  • Academic Research: Find mentions of historical figures in digitized texts
  • Journalism: Track politician mentions across news coverage
  • Legal Discovery: Locate person references in legal documents
  • Genealogy: Search family names across historical records
  • Business Intelligence: Monitor executive mentions in financial reports

📈 Performance

  • Speed: Process 1M+ documents on modern hardware
  • Memory: Streaming support for unlimited dataset sizes
  • Accuracy: 95%+ precision with proper configuration
  • Scalability: Linear scaling across CPU cores

🤝 Contributing

We welcome contributions! Please see our Code of Conduct.

📄 License

This project is licensed under the MIT License.

👥 Authors

  • Suriyan Laohaprapanon - Original creator
  • Gaurav Sood - Co-creator and maintainer

For detailed documentation and advanced usage, visit our documentation site.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

search_names-0.5.0.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

search_names-0.5.0-py3-none-any.whl (50.4 kB view details)

Uploaded Python 3

File details

Details for the file search_names-0.5.0.tar.gz.

File metadata

  • Download URL: search_names-0.5.0.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_names-0.5.0.tar.gz
Algorithm Hash digest
SHA256 f0f88db2efa34cba8e9d4a103e101e7dde3f4af99321ed087dcc633273593c9e
MD5 11800918d6d9a82b8ebeb7c9d189370f
BLAKE2b-256 5e532729d61e7b665cf78a0e85341eca8ec97e034e2986c1ceec93eafff3a607

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_names-0.5.0.tar.gz:

Publisher: python-publish.yml on appeler/search-names

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file search_names-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: search_names-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 50.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for search_names-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a9c93a91ae109387d31e8cb3c689c7fc4f8388924221f0350782a3892c235a79
MD5 90b1fb7864099bb5f8d66cd288de3e70
BLAKE2b-256 be2949a06be6fd8bbe49cc5fa9779d26ace3b824af379590b2097030226d8d7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for search_names-0.5.0-py3-none-any.whl:

Publisher: python-publish.yml on appeler/search-names

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page