Skip to main content

A robust tool for converting scientific literature CSV files to OSCAP-compatible format

Project description

Oscapify

PyPI version Python versions License: MIT

A robust tool for converting scientific literature CSV files to OSCAP-compatible format. Oscapify processes neuroscience connectivity data from PubMed/PMC sources, validates headers, retrieves DOIs, and handles errors gracefully.

Features

  • Intelligent Header Validation: Automatically detects and corrects common header issues
  • Flexible Header Mapping: Support for custom column names and formats
  • DOI Retrieval: Fetches DOIs from NCBI API with built-in caching
  • Error Recovery: Continues processing even when individual records fail
  • Detailed Debugging: Comprehensive logging and header analysis tools
  • Batch Processing: Process multiple files or entire directories
  • Performance: Persistent caching and efficient batch operations

Installation

pip install oscapify

Development Installation

git clone https://github.com/yourusername/oscapify.git
cd oscapify
pip install -e ".[dev]"

Quick Start

Basic Usage

# Process a single file
oscapify process input.csv

# Process multiple files
oscapify process file1.csv file2.csv

# Process all CSV files in a directory
oscapify process /path/to/csv/directory/

# Specify output directory
oscapify process input.csv --output ./results

Header Validation and Debugging

# Validate CSV headers and see debugging info
oscapify validate input.csv

# Get header mapping suggestions
oscapify validate input.csv --suggest-mappings

Custom Header Mapping

If your CSV files use different column names:

oscapify process input.csv \
  --header-pmid "PubMedID" \
  --header-sentence "text" \
  --preserve-fields "custom_field1" "custom_field2"

Expected Input Format

Oscapify expects CSV files with the following columns (case-insensitive):

  • pmid - PubMed ID
  • sentence - Text content
  • pmcid (optional) - PubMed Central ID
  • pubmed_url (optional) - URL to PubMed/PMC article

Additional columns are preserved in the output.

Example Input CSV

ID,pmid,pmcid,sentence,structure_1,structure_2,relation,score,pubmed_url
1,12345678,PMC1234567,"The brain connects to the spinal cord.",brain,spinal cord,connects,0.95,https://pubmed.ncbi.nlm.nih.gov/12345678/

Output Format

Oscapify outputs CSV files with OSCAP-compatible formatting:

  • id - Unique identifier (format: nlp-{index}-{date})
  • pmid - PubMed ID
  • pmcid - PubMed Central ID
  • doi - Digital Object Identifier (retrieved from NCBI)
  • sentence - Original text
  • batch_name - Processing batch identifier
  • sentence_id - Sentence identifier
  • out_of_scope - "yes" if DOI couldn't be retrieved, "no" otherwise

Advanced Features

Cache Management

# View cache statistics
oscapify cache-stats

# Clear the DOI cache
oscapify clear-cache

Error Handling Options

# Stop on first error (strict mode)
oscapify process input.csv --strict

# Disable caching for testing
oscapify process input.csv --no-cache

# Skip header validation
oscapify process input.csv --no-validation

Debug Mode

# Enable detailed debug logging
oscapify process input.csv --debug

Python API

from oscapify import OscapifyProcessor
from oscapify.models import ProcessingConfig

# Create configuration
config = ProcessingConfig(
    output_dir="./output",
    batch_name="my_batch"
)

# Process files
processor = OscapifyProcessor(config)
stats = processor.process_files(["input1.csv", "input2.csv"])

# Check results
print(f"Processed {stats.processed_files} files")
print(f"Total records: {stats.total_records}")
print(f"DOI lookups: {stats.successful_doi_lookups} successful, {stats.failed_doi_lookups} failed")

Configuration

Custom Header Mapping via API

from oscapify.models import HeaderMapping, ProcessingConfig

# Define custom mapping
header_mapping = HeaderMapping(
    pmid="PubMedID",
    sentence="abstract_text",
    pmcid="PMC_ID",
    preserve_fields=["experiment_type", "confidence_score"]
)

config = ProcessingConfig(
    header_mapping=header_mapping
)

Troubleshooting

Common Issues

  1. Missing Headers Error

    # Check what headers are in your file
    oscapify validate problematic.csv
    
    # Use suggested mappings
    oscapify validate problematic.csv --suggest-mappings
    
  2. DOI Retrieval Failures

    • Check your internet connection
    • The tool implements rate limiting (3 requests/second) for NCBI API compliance
  3. Encoding Errors

    • Oscapify automatically tries multiple encodings
    • If issues persist, convert your CSV to UTF-8

Getting Help

# View all commands and options
oscapify --help

# View help for specific command
oscapify process --help

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use Oscapify in your research, please cite:

@software{oscapify,
  author = {Troy Sincomb},
  title = {Oscapify: A tool for converting scientific literature CSV files to OSCAP format},
  year = {2025},
  url = {https://github.com/yourusername/oscapify}
}

Acknowledgments

  • Uses the NCBI E-utilities API for DOI retrieval
  • Built with Click for CLI interface
  • Pandas for data processing
  • Pydantic for data validation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oscapify-0.1.1.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oscapify-0.1.1-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file oscapify-0.1.1.tar.gz.

File metadata

  • Download URL: oscapify-0.1.1.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0

File hashes

Hashes for oscapify-0.1.1.tar.gz
Algorithm Hash digest
SHA256 135ab54cb3abf078d4a363fff447c54b05b3097607a344a487445032ab1575f4
MD5 38f7e4a445984cfbfc024f0ff5cb831b
BLAKE2b-256 0ea888c861ef4af9cc2e59d3bfcaddb0cd6fe3046d0187ac6c1be0ecb1312f9b

See more details on using hashes here.

File details

Details for the file oscapify-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: oscapify-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0

File hashes

Hashes for oscapify-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 da73a720acc0e3fac4e05d144abcecee090e973950cec20c6ca50c8fa7068d17
MD5 e8ed9890cabe29e5f0e6db9c0f6ac382
BLAKE2b-256 3a5d27faca79d2603e707f327d8a8cfaced5416d8314d60a693edb75bd0ab6b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page