A robust tool for converting scientific literature CSV files to OSCAP-compatible format
Project description
Oscapify
A robust tool for converting scientific literature CSV files to OSCAP-compatible format. Oscapify processes neuroscience connectivity data from PubMed/PMC sources, validates headers, retrieves DOIs, and handles errors gracefully.
Features
- Intelligent Header Validation: Automatically detects and corrects common header issues
- Flexible Header Mapping: Support for custom column names and formats
- DOI Retrieval: Fetches DOIs from NCBI API with built-in caching
- Error Recovery: Continues processing even when individual records fail
- Detailed Debugging: Comprehensive logging and header analysis tools
- Batch Processing: Process multiple files or entire directories
- Performance: Persistent caching and efficient batch operations
Installation
pip install oscapify
Development Installation
git clone https://github.com/yourusername/oscapify.git
cd oscapify
pip install -e ".[dev]"
Quick Start
Basic Usage
# Process a single file
oscapify process input.csv
# Process multiple files
oscapify process file1.csv file2.csv
# Process all CSV files in a directory
oscapify process /path/to/csv/directory/
# Specify output directory
oscapify process input.csv --output ./results
Header Validation and Debugging
# Validate CSV headers and see debugging info
oscapify validate input.csv
# Get header mapping suggestions
oscapify validate input.csv --suggest-mappings
Custom Header Mapping
If your CSV files use different column names:
oscapify process input.csv \
--header-pmid "PubMedID" \
--header-sentence "text" \
--preserve-fields "custom_field1" "custom_field2"
Expected Input Format
Oscapify expects CSV files with the following columns (case-insensitive):
pmid- PubMed IDsentence- Text contentpmcid(optional) - PubMed Central IDpubmed_url(optional) - URL to PubMed/PMC article
Additional columns are preserved in the output.
Example Input CSV
ID,pmid,pmcid,sentence,structure_1,structure_2,relation,score,pubmed_url
1,12345678,PMC1234567,"The brain connects to the spinal cord.",brain,spinal cord,connects,0.95,https://pubmed.ncbi.nlm.nih.gov/12345678/
Output Format
Oscapify outputs CSV files with OSCAP-compatible formatting:
id- Unique identifier (format:nlp-{index}-{date})pmid- PubMed IDpmcid- PubMed Central IDdoi- Digital Object Identifier (retrieved from NCBI)sentence- Original textbatch_name- Processing batch identifiersentence_id- Sentence identifierout_of_scope- "yes" if DOI couldn't be retrieved, "no" otherwise
Advanced Features
Cache Management
# View cache statistics
oscapify cache-stats
# Clear the DOI cache
oscapify clear-cache
Error Handling Options
# Stop on first error (strict mode)
oscapify process input.csv --strict
# Disable caching for testing
oscapify process input.csv --no-cache
# Skip header validation
oscapify process input.csv --no-validation
Debug Mode
# Enable detailed debug logging
oscapify process input.csv --debug
Python API
from oscapify import OscapifyProcessor
from oscapify.models import ProcessingConfig
# Create configuration
config = ProcessingConfig(
output_dir="./output",
batch_name="my_batch"
)
# Process files
processor = OscapifyProcessor(config)
stats = processor.process_files(["input1.csv", "input2.csv"])
# Check results
print(f"Processed {stats.processed_files} files")
print(f"Total records: {stats.total_records}")
print(f"DOI lookups: {stats.successful_doi_lookups} successful, {stats.failed_doi_lookups} failed")
Configuration
Custom Header Mapping via API
from oscapify.models import HeaderMapping, ProcessingConfig
# Define custom mapping
header_mapping = HeaderMapping(
pmid="PubMedID",
sentence="abstract_text",
pmcid="PMC_ID",
preserve_fields=["experiment_type", "confidence_score"]
)
config = ProcessingConfig(
header_mapping=header_mapping
)
Troubleshooting
Common Issues
-
Missing Headers Error
# Check what headers are in your file oscapify validate problematic.csv # Use suggested mappings oscapify validate problematic.csv --suggest-mappings
-
DOI Retrieval Failures
- Check your internet connection
- The tool implements rate limiting (3 requests/second) for NCBI API compliance
-
Encoding Errors
- Oscapify automatically tries multiple encodings
- If issues persist, convert your CSV to UTF-8
Getting Help
# View all commands and options
oscapify --help
# View help for specific command
oscapify process --help
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use Oscapify in your research, please cite:
@software{oscapify,
author = {Troy Sincomb},
title = {Oscapify: A tool for converting scientific literature CSV files to OSCAP format},
year = {2025},
url = {https://github.com/yourusername/oscapify}
}
Acknowledgments
- Uses the NCBI E-utilities API for DOI retrieval
- Built with Click for CLI interface
- Pandas for data processing
- Pydantic for data validation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oscapify-0.1.1.tar.gz.
File metadata
- Download URL: oscapify-0.1.1.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
135ab54cb3abf078d4a363fff447c54b05b3097607a344a487445032ab1575f4
|
|
| MD5 |
38f7e4a445984cfbfc024f0ff5cb831b
|
|
| BLAKE2b-256 |
0ea888c861ef4af9cc2e59d3bfcaddb0cd6fe3046d0187ac6c1be0ecb1312f9b
|
File details
Details for the file oscapify-0.1.1-py3-none-any.whl.
File metadata
- Download URL: oscapify-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.5 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da73a720acc0e3fac4e05d144abcecee090e973950cec20c6ca50c8fa7068d17
|
|
| MD5 |
e8ed9890cabe29e5f0e6db9c0f6ac382
|
|
| BLAKE2b-256 |
3a5d27faca79d2603e707f327d8a8cfaced5416d8314d60a693edb75bd0ab6b0
|