Skip to main content

Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature

Project description

Specifind logo

Specifind is a Python toolkit built to automatically extract species occurrence information from unstructured ecological literature. It identifies scientific species names, geographic entities, and the relations connecting them—unlocking occurrence data hidden in text.

The toolkit integrates OCR, layout analysis, Named Entity Recognition, Coreference Resolution, and Relation Extraction into a unified and traceable pipeline. It is powered by a newly developed, expertly annotated dataset of 1,000+ ecological abstracts spanning biogeography, botany, entomology, mycology, and zoology.


🌿 Key Features

  • 📄 Science-OCR for domain-optimized OCR of scientific papers
  • 🔍 NER for scientific species names & geographic entities
  • 🌍 Relation Extraction connecting species to locations
  • 🧠 FastCOREF for high-performance coreference resolution
  • 🧩 Built on spaCy for extensibility, speed, and NLP interoperability
  • 🧱 Full pipeline for text & PDF extraction
  • 🧭 Traceability that links extractions back to the original text

📦 Installation

pip install specifind

🚀 Quick Start

Basic Usage

from specifind import Specifind

s = Specifind()

s.analyze("Upupa epops is an exotic bird. It is widely extended over Spain.")

# or

s.analyze_file("path/to/file.pdf")

# Output:
# {
#     "species": [
#         "Upupa epops"
#     ],
#     "geography": [
#         "Spain"
#     ],
#     "occurrences": {
#         "Upupa epops": [
#             "Spain"
#         ]
#     },
#     "evidence": {
#         "Upupa epops": {
#             "Spain": [
#                 "It is widely extended over Spain."
#             ]
#         }
#     }
# }

📘 API Reference

class Specifind(...)

Initializes the OCR engine.

Parameter Type Default Description
use_gpu bool True If True, uses GPU if available. If False, forces CPU usage, which may be slower but more stable on some systems and avoid memory issues.
debug bool False If True, opens the displaCy visualizer with the results

analyze_file(...)

Process and extract information from a PDF file.

Parameters

Name Type Default Description
path str Path to the file to analyze.
first_page int 0 First page to process (inclusive).
last_page int PDF page length Last page to process (exclusive).
coref bool True Enable coreference resolution.
dpi int 192 if GPU available else 96 Rendering DPI for PDF pages. Consider lowering the value if running out of memory (OOM).
return_doc bool False If True, return Spacy Doc object with the annotations available in doc.ents and doc._.relations.
store_ocr bool True If True, saves OCR results into a txt file

Returns

  • Dictionary including parsed entities, relations and evidences.
  • (optional) internal doc object (if return_doc=True)

analyze(...)

Process and extract information directly from raw text.

Parameters

Name Type Default Description
text str Raw text to analyze.
coref bool True Enable coreference resolution.
return_doc bool False If True, return Spacy Doc object with the annotations.

Returns

  • Dictionary including parsed entities, relations and evidences.
  • (optional) internal doc object

🚀 Benchmarks

Named Entity Recognition (NER)

Species & Locations

🔍 Match Type 🎯 Precision 📈 Recall 🏆 F1
Exact 0.904 0.935 0.919
Partial/Intersect 0.938 0.969 0.958

Relation Extraction (RE)

Occurrences

🎯 Precision 📈 Recall 🏆 F1
0.964 0.993 0.978

🤝 Contributing

Contributions, issue reports, and feature suggestions are welcome. Feel free to open a Pull Request or discussion.


📄 License

Specifind is licensed under AGPL-3.0. See LICENSE for details.


📚 Citing Specifind

If you use Specifind in your research, please cite our pre-print:

BibTeX

@article{specifind2025,
  title   = {Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature},
  author  = {Golomb Durán Tomas, Díaz Anna, Barroso María, Far Antoni Josep, Roldán Alejandro, Cancellario Tommaso},
  year    = {2025},
  journal = {BioRxiv},
  url     = {https://github.com/ToGo347/specifind}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

specifind-0.2.0.tar.gz (581.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

specifind-0.2.0-py3-none-any.whl (743.5 kB view details)

Uploaded Python 3

File details

Details for the file specifind-0.2.0.tar.gz.

File metadata

  • Download URL: specifind-0.2.0.tar.gz
  • Upload date:
  • Size: 581.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.2.0.tar.gz
Algorithm Hash digest
SHA256 af09598007ffa75c81481871d64a13202cbc75806baa83c09725584d70f7ffa6
MD5 8336f80fb44acf7d00d31e14021d34c6
BLAKE2b-256 cb3370eecadc1bed1279666f9cac064675bfa7d1ff39acf7e01c8b90b2b6c74a

See more details on using hashes here.

File details

Details for the file specifind-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: specifind-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 743.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e57092f3cd6e7cd285146603026855a70daca481f020e30e6a844eed055e4fb5
MD5 9ca54395ef0d2da6af072c6dd057ac3a
BLAKE2b-256 5532fd25c255b223045e4fbaf0a3285bc4cdc94b568fd0ae7d73c1c37d444acf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page