Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature
Project description
Specifind
Specifind is a Python toolkit built to automatically extract species occurrence information from unstructured ecological literature. It identifies scientific species names, geographic entities, and the relations connecting them—unlocking occurrence data hidden in text.
The toolkit integrates OCR, layout analysis, Named Entity Recognition, Coreference Resolution, and Relation Extraction into a unified and traceable pipeline. It is powered by a newly developed, expertly annotated dataset of 1,000+ ecological abstracts spanning biogeography, botany, entomology, mycology, and zoology.
🌿 Key Features
- 📄 Science-OCR for domain-optimized OCR of scientific papers
- 🔍 NER for scientific species names & geographic entities
- 🌍 Relation Extraction connecting species to locations
- 🧠 FastCOREF for high-performance coreference resolution
- 🧩 Built on spaCy for extensibility, speed, and NLP interoperability
- 🧱 Full pipeline for text & PDF extraction
- 🧭 Traceability that links extractions back to the original text
📦 Installation
pip install specifind
🚀 Quick Start
Basic Usage
from specifind import Specifind
s = Specifind()
s.analyze("Upupa epops is an exotic bird. It is widely extended over Spain.")
# or
s.analyze_file("path/to/file.pdf")
# Output:
# {
# "species": [
# "Upupa epops"
# ],
# "geography": [
# "Spain"
# ],
# "occurrences": {
# "Upupa epops": [
# "Spain"
# ]
# },
# "evidence": {
# "Upupa epops": {
# "Spain": [
# "It is widely extended over Spain."
# ]
# }
# }
# }
📘 API Reference
analyze_file(...)
Process and extract information from a PDF file.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
path |
str | — | Path to the file to analyze. |
first_page |
int | 0 | First page to process (inclusive). |
last_page |
int | PDF page length | Last page to process (exclusive). |
coref |
bool | True | Enable coreference resolution. |
dpi |
int | 192 if GPU available else 96 |
Rendering DPI for PDF pages. Consider lowering the value if running out of memory (OOM). |
return_doc |
bool | False | If True, return Spacy Doc object with the annotations available in doc.ents and doc._.relations. |
Returns
- Dictionary including parsed entities, relations and evidences.
- (optional) internal doc object (if
return_doc=True)
analyze(...)
Process and extract information directly from raw text.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
text |
str | — | Raw text to analyze. |
coref |
bool | True | Enable coreference resolution. |
return_doc |
bool | False | If True, return Spacy Doc object with the annotations. |
Returns
- Dictionary including parsed entities, relations and evidences.
- (optional) internal doc object
🚀 Benchmarks
Named Entity Recognition (NER)
Species & Locations
| 🔍 Match Type | 🎯 Precision | 📈 Recall | 🏆 F1 |
|---|---|---|---|
| Exact | 0.904 | 0.935 | 0.919 |
| Partial/Intersect | 0.938 | 0.969 | 0.958 |
Relation Extraction (RE)
Occurrences
| 🎯 Precision | 📈 Recall | 🏆 F1 |
|---|---|---|
| 0.964 | 0.993 | 0.978 |
🤝 Contributing
Contributions, issue reports, and feature suggestions are welcome. Feel free to open a Pull Request or discussion.
📄 License
Specifind is licensed under AGPL-3.0. See LICENSE for details.
📚 Citing Specifind
If you use Specifind in your research, please cite our pre-print:
BibTeX
@article{specifind2025,
title = {Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature},
author = {Golomb Durán Tomas, Díaz Anna, Barroso María, Far Antoni Josep, Roldán Alejandro, Cancellario Tommaso},
year = {2025},
journal = {BioRxiv},
url = {https://github.com/ToGo347/specifind}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file specifind-0.1.0.tar.gz.
File metadata
- Download URL: specifind-0.1.0.tar.gz
- Upload date:
- Size: 581.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd987d83d336bccd1177ef2bf166dad423c3d6fa524b5f415614713ef6e96148
|
|
| MD5 |
d2cf5140988d29c32ef751ffd52ee629
|
|
| BLAKE2b-256 |
74a1ca72a473b526fd8a3bd15d2ea21931fce2d44eedad5ea6441385ba1e08b1
|
File details
Details for the file specifind-0.1.0-py3-none-any.whl.
File metadata
- Download URL: specifind-0.1.0-py3-none-any.whl
- Upload date:
- Size: 743.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b1b7785dc5e64ec422299e95f758de14004994612ee7b163a4c49a09cf46335
|
|
| MD5 |
86c01ff95c334db0726396797bdce6a0
|
|
| BLAKE2b-256 |
98b451837370ca193cc6a1dba90a78d221492075c2b01063f268e265df34cee6
|