Drug discovery NER wrapper around LangExtract — zero-config entity extraction for chemistry and biology.
Project description
structflo.ner
Zero-config Named Entity Recognition for drug discovery, chemistry, and biological sciences.
Installation • LLM Extraction • Fast NER • Profiles • Visualization • Notebooks
structflo.ner is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses LangExtract and fuzzy based tools to deliver zero-configuration entity extraction.
It ships with two extraction engines:
NERExtractor |
FastNERExtractor |
|
|---|---|---|
| Approach | LLM-powered (Gemini, Ollama) | Dictionary-based (YAML gazetteers) |
| Speed | ~10-60s per abstract | ~0.4-1s per abstract |
| Novel entities | Discovers new entities | Known terms only |
| Context awareness | Full contextual understanding | String matching (exact + fuzzy) |
| Cost | API costs or local GPU | Free (no API calls) |
| Setup | API key or Ollama | Zero config |
| Output format | NERResult |
NERResult (identical) |
Installation
pip install structflo-ner
# or with uv
uv add structflo-ner
Install optional extras as needed:
pip install "structflo-ner[dataframe]" # pandas DataFrame support
pip install "structflo-ner[fast]" # fast dictionary-based NER (rapidfuzz)
LLM-Powered Extraction
Cloud model (Gemini)
The default model is gemini-2.5-flash. Pass your API key or set the GEMINI_API_KEY environment variable.
from structflo.ner import NERExtractor
extractor = NERExtractor(api_key="YOUR_GEMINI_KEY")
result = extractor.extract(
"Gefitinib (ZD1839) is a first-generation EGFR tyrosine kinase inhibitor "
"with IC50 = 0.033 µM, approved for non-small cell lung cancer (NSCLC). "
"Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1."
)
Local models via Ollama
Run extraction entirely on your own hardware — no API key needed:
extractor = NERExtractor(
model_id="qwen2.5:72b",
model_url="http://localhost:11434",
)
text = ("Gefitinib (ZD1839) is a first-generation EGFR inhibitor with IC50 = 0.033 µM approved for NSCLC."
"Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1.")
result = extractor.extract(text)
result
Any model served by Ollama works gemma, llama, mistral, qwen, deepseek, etc.
Render results as color-coded, interactive HTML directly in Jupyter notebooks:
To get a PANDAS dataframe.
result.to_dataframe()
TB specific extractor pass in the profile=TB
from structflo.ner import NERExtractor, TB
extractor = NERExtractor(
model_id="qwen2.5:72b",
model_url="http://localhost:11434",
profile=TB,
text = (
"Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
"mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
"It shows potent activity against Mycobacterium tuberculosis "
"including MDR-TB and XDR-TB. This compound was identified through "
"whole-cell screening and targets the energy metabolism pathway."
)
result = extractor.extract(text)
result
# Flat list of all entities
for entity in result.all_entities():
print(f"{entity.entity_type:20s} | {entity.text}")
compound_name | Bedaquiline
compound_name | TMC207
target | ATP synthase subunit c
disease | MDR-TB
disease | XDR-TB
accession_number | Rv1305
functional_category | energy metabolism pathway
screening_method | whole-cell screening
Batch extraction
Pass a list of texts to extract from multiple documents at once:
texts = [
"Imatinib inhibits BCR-ABL with IC50 = 0.6 µM in CML.",
"Trastuzumab targets HER2 in breast cancer patients.",
"Remdesivir (GS-5734) is an antiviral with EC50 = 0.77 µM against SARS-CoV-2.",
]
results = extractor.extract(texts)
--- Text 1 ---
compound_name | Imatinib
target | BCR-ABL
disease | CML
bioactivity | IC50 = 0.6 µM
--- Text 2 ---
compound_name | Trastuzumab
target | HER2
disease | breast cancer
--- Text 3 ---
compound_name | Remdesivir
compound_name | GS-5734
disease | SARS-CoV-2
bioactivity | EC50 = 0.77 µM
Fast Dictionary-Based NER (Mode 2)
FastNERExtractor uses curated YAML gazetteers with a three-phase matching strategy for deterministic, high-speed extraction when LLMs are not available.
These run extremely fast, however they are fuzzy based matches to predefined patterns and so it does not understands context.
from structflo.ner.fast import FastNERExtractor
fast = FastNERExtractor()
text = (
"Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
"mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
"It shows potent activity against Mycobacterium tuberculosis "
"including MDR-TB and XDR-TB. This compound was identified through "
"whole-cell screening and targets the energy metabolism pathway."
)
result = fast.extract(text)
result
How matching works
| Phase | Method | What it catches |
|---|---|---|
| 1 | Exact match | Case-sensitive and normalized dictionary lookups with word-boundary enforcement |
| 1b | Regex patterns | Auto-derived patterns from accession number seeds (Rv tags, UniProt, PDB, etc.) |
| 2 | Fuzzy match | Typos and minor variants via rapidfuzz (configurable threshold) |
# Fuzzy matching catches typos
result = fast.extract("Bedaquilne showed activity against TB")
# "Bedaquilne" -> canonical: "Bedaquiline" (method: fuzzy)
# Disable fuzzy matching for strict mode
strict = FastNERExtractor(fuzzy_threshold=0)
Built-in gazetteers
The fast extractor ships with curated gazetteers for TB drug discovery:
| Gazetteer | Examples |
|---|---|
accession_number |
Rv1305, B586_RS00005 |
gene_name |
atpE, InhA, DprE1 |
screening_method |
whole-cell screening, fragment-based screening |
target |
InhA, DprE1, MmpL3 |
compound_name |
Bedaquiline, Delamanid, Pretomanid |
functional_category |
DNA replication, cell wall biosynthesis |
strain |
M. tuberculosis H37Rv |
product |
enoyl-ACP reductase, ATP synthase subunit c |
disease |
TB, MDR-TB, XDR-TB |
Custom gazetteers
Extend the built-in dictionaries with your own terms:
custom = FastNERExtractor(
extra_gazetteers={
"target": ["MyNovelTarget", "KinaseX"],
"compound_name": ["CompoundABC"],
}
)
Or drop a new YAML file into the gazetteers directory — the filename (without .yml) maps to an entity type.
Performance
Single abstract: ~393 ms
8 abstracts: ~862 ms
Profiles
Profiles control which entity types are extracted. Use them to focus the model on specific categories.
Built-in profiles
| Profile | Entity classes |
|---|---|
FULL (default) |
compounds, targets, diseases, bioactivities, assays, mechanisms |
CHEMISTRY |
compound names, SMILES, CAS numbers, molecular formulas |
BIOLOGY |
targets, gene names, protein names |
BIOACTIVITY |
bioactivity measurements, assays |
DISEASE |
diseases and clinical indications |
TB |
TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories) |
from structflo.ner import NERExtractor, CHEMISTRY
extractor = NERExtractor(api_key="YOUR_GEMINI_KEY")
result = extractor.extract(text, profile=CHEMISTRY)
Merging profiles
Combine multiple profiles for broader extraction:
from structflo.ner import CHEMISTRY, BIOLOGY
combined = CHEMISTRY.merge(BIOLOGY)
result = extractor.extract(text, profile=combined)
# Profile: chemistry+biology
# Entity classes: compound_name, smiles, cas_number, molecular_formula, target, gene_name, protein_name
Custom profiles
Define your own extraction schema:
from structflo.ner import NERExtractor, EntityProfile
my_profile = EntityProfile(
name="kinase_inhibitors",
entity_classes=["compound_name", "smiles", "target", "bioactivity"],
prompt="Extract kinase inhibitor names, SMILES, targets, and potency values.",
examples=my_examples,
)
result = extractor.extract(text, profile=my_profile)
Working with Results
Both extractors return identical NERResult objects:
# Typed entity lists
result.compounds # [ChemicalEntity(...)]
result.targets # [TargetEntity(...)]
result.diseases # [DiseaseEntity(...)]
result.bioactivities # [BioactivityEntity(...)]
result.assays # [...]
result.mechanisms # [...]
result.accessions # [AccessionEntity(...)]
# Flat list of all entities
result.all_entities()
# Export to pandas DataFrame
df = result.to_dataframe()
# Serialize to dict (JSON-friendly)
result.to_dict()
Notebooks
Explore worked examples in the notebooks/ directory:
| Notebook | Description |
|---|---|
| 01_quickstart.ipynb | End-to-end extraction with cloud and local models, profiles, batch extraction |
| 02_fast_ner.ipynb | Fast dictionary-based NER — matching strategies, custom gazetteers, performance |
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
# clone and install dev dependencies
git clone https://github.com/structflo/structflo-ner.git
cd structflo-ner
pip install -e ".[dataframe]" --group dev
# run tests
pytest
# lint
ruff check .
ruff format .
Citation
If you use structflo.ner in your research, please cite:
BibTeX
@software{structflo_ner,
title = {structflo.ner: Zero-config NER for Drug Discovery},
url = {https://github.com/structflo/structflo-ner},
year = {2026}
}
License
This project is licensed under the Apache License 2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structflo_ner-0.3.0.tar.gz.
File metadata
- Download URL: structflo_ner-0.3.0.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2c71ffe0b8c0ed2ab8cbce29f53caf4bb16d2f82fb43cd670c3a59c26040871
|
|
| MD5 |
8ec0584af43a14c421c619645105608b
|
|
| BLAKE2b-256 |
d6ef3c57f4d6367382c16c928ef70e2addfa9f9ae5b6e86a14c8e23fd7c798dd
|
Provenance
The following attestation bundles were made for structflo_ner-0.3.0.tar.gz:
Publisher:
publish.yml on structflo/structflo-ner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structflo_ner-0.3.0.tar.gz -
Subject digest:
a2c71ffe0b8c0ed2ab8cbce29f53caf4bb16d2f82fb43cd670c3a59c26040871 - Sigstore transparency entry: 1009555425
- Sigstore integration time:
-
Permalink:
structflo/structflo-ner@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/structflo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc -
Trigger Event:
push
-
Statement type:
File details
Details for the file structflo_ner-0.3.0-py3-none-any.whl.
File metadata
- Download URL: structflo_ner-0.3.0-py3-none-any.whl
- Upload date:
- Size: 257.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c3efdafc394f1635aeb2fd6b75ea1929cd92fd9de7a304a64f616bbcebb2f5a
|
|
| MD5 |
44f294549523ec8b6906450b1966856c
|
|
| BLAKE2b-256 |
bbbc7aa821b4d5587bea6980b544740e7771a2422dfe64b3f6da1ceb6a28e915
|
Provenance
The following attestation bundles were made for structflo_ner-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on structflo/structflo-ner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structflo_ner-0.3.0-py3-none-any.whl -
Subject digest:
0c3efdafc394f1635aeb2fd6b75ea1929cd92fd9de7a304a64f616bbcebb2f5a - Sigstore transparency entry: 1009555473
- Sigstore integration time:
-
Permalink:
structflo/structflo-ner@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/structflo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc -
Trigger Event:
push
-
Statement type: