Skip to main content

A robust tool for converting scientific literature CSV files to OSCAP-compatible format

Project description

Oscapify

PyPI version Python versions License: MIT

A robust tool for converting scientific literature CSV files to OSCAP-compatible format. Oscapify processes neuroscience connectivity data from PubMed/PMC sources, validates headers, retrieves DOIs, and handles errors gracefully.

Features

  • Intelligent Header Validation: Automatically detects and corrects common header issues
  • Flexible Header Mapping: Support for custom column names and formats
  • DOI Retrieval: Fetches DOIs from NCBI API with built-in caching
  • Error Recovery: Continues processing even when individual records fail
  • Detailed Debugging: Comprehensive logging and header analysis tools
  • Batch Processing: Process multiple files or entire directories
  • Performance: Persistent caching and efficient batch operations

Installation

pip install oscapify

Development Installation

git clone https://github.com/yourusername/oscapify.git
cd oscapify
pip install -e ".[dev]"

Quick Start

Basic Usage

# Process a single file
oscapify process input.csv

# Process multiple files
oscapify process file1.csv file2.csv

# Process all CSV files in a directory
oscapify process /path/to/csv/directory/

# Specify output directory
oscapify process input.csv --output ./results

Header Validation and Debugging

# Validate CSV headers and see debugging info
oscapify validate input.csv

# Get header mapping suggestions
oscapify validate input.csv --suggest-mappings

Custom Header Mapping

If your CSV files use different column names:

oscapify process input.csv \
  --header-pmid "PubMedID" \
  --header-sentence "text" \
  --preserve-fields "custom_field1" "custom_field2"

Expected Input Format

Oscapify expects CSV files with the following columns (case-insensitive):

  • pmid - PubMed ID
  • sentence - Text content
  • pmcid (optional) - PubMed Central ID
  • pubmed_url (optional) - URL to PubMed/PMC article

Additional columns are preserved in the output.

Example Input CSV

ID,pmid,pmcid,sentence,structure_1,structure_2,relation,score,pubmed_url
1,12345678,PMC1234567,"The brain connects to the spinal cord.",brain,spinal cord,connects,0.95,https://pubmed.ncbi.nlm.nih.gov/12345678/

Output Format

Oscapify outputs CSV files with OSCAP-compatible formatting:

  • id - Unique identifier (format: nlp-{index}-{date})
  • pmid - PubMed ID
  • pmcid - PubMed Central ID
  • doi - Digital Object Identifier (retrieved from NCBI)
  • sentence - Original text
  • batch_name - Processing batch identifier
  • sentence_id - Sentence identifier
  • out_of_scope - "yes" if DOI couldn't be retrieved, "no" otherwise

Advanced Features

Cache Management

# View cache statistics
oscapify cache-stats

# Clear the DOI cache
oscapify clear-cache

Error Handling Options

# Stop on first error (strict mode)
oscapify process input.csv --strict

# Disable caching for testing
oscapify process input.csv --no-cache

# Skip header validation
oscapify process input.csv --no-validation

Debug Mode

# Enable detailed debug logging
oscapify process input.csv --debug

Python API

from oscapify import OscapifyProcessor
from oscapify.models import ProcessingConfig

# Create configuration
config = ProcessingConfig(
    output_dir="./output",
    batch_name="my_batch"
)

# Process files
processor = OscapifyProcessor(config)
stats = processor.process_files(["input1.csv", "input2.csv"])

# Check results
print(f"Processed {stats.processed_files} files")
print(f"Total records: {stats.total_records}")
print(f"DOI lookups: {stats.successful_doi_lookups} successful, {stats.failed_doi_lookups} failed")

Configuration

Custom Header Mapping via API

from oscapify.models import HeaderMapping, ProcessingConfig

# Define custom mapping
header_mapping = HeaderMapping(
    pmid="PubMedID",
    sentence="abstract_text",
    pmcid="PMC_ID",
    preserve_fields=["experiment_type", "confidence_score"]
)

config = ProcessingConfig(
    header_mapping=header_mapping
)

Troubleshooting

Common Issues

  1. Missing Headers Error

    # Check what headers are in your file
    oscapify validate problematic.csv
    
    # Use suggested mappings
    oscapify validate problematic.csv --suggest-mappings
    
  2. DOI Retrieval Failures

    • Check your internet connection
    • The tool implements rate limiting (3 requests/second) for NCBI API compliance
  3. Encoding Errors

    • Oscapify automatically tries multiple encodings
    • If issues persist, convert your CSV to UTF-8

Getting Help

# View all commands and options
oscapify --help

# View help for specific command
oscapify process --help

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Oscapify in your research, please cite:

@software{oscapify,
  author = {Troy Sincomb},
  title = {Oscapify: A tool for converting scientific literature CSV files to OSCAP format},
  year = {2025},
  url = {https://github.com/yourusername/oscapify}
}

Acknowledgments

  • Uses the NCBI E-utilities API for DOI retrieval
  • Built with Click for CLI interface
  • Pandas for data processing
  • Pydantic for data validation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oscapify-0.1.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oscapify-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file oscapify-0.1.0.tar.gz.

File metadata

  • Download URL: oscapify-0.1.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0

File hashes

Hashes for oscapify-0.1.0.tar.gz
Algorithm Hash digest
SHA256 53bdc011d56d1a17d5851393d3df9d34829c48a7e29f8b617bb41cd6e6566a03
MD5 4a56c6c64ab46d52001f3bf6adb12b72
BLAKE2b-256 6cb3ebddc9615bb4e07a3564d6c9d352b4b4f5886183af27588ceaf31581b47e

See more details on using hashes here.

File details

Details for the file oscapify-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: oscapify-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0

File hashes

Hashes for oscapify-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0333d17142e1d182cb724685843315c6d51a03981346c91dffb82a0dd9e9111a
MD5 83478989e14f0de164c3ff586852e6b0
BLAKE2b-256 1395b1d2d3d21047da97abd67333002c1ccc67e00b094cd7c31a9ab2a9070751

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page