Drug discovery NER wrapper around LangExtract — zero-config entity extraction for chemistry and biology.

Project description

structflo.ner

structflo.ner logo

Zero-config Named Entity Recognition for drug discovery, chemistry, and biological sciences.

Installation • LLM Extraction • Fast NER • Profiles • Visualization • Notebooks

structflo.ner is a lightweight NER library specialized for pharmaceutical and biological sciences. It uses LangExtract and fuzzy based tools to deliver zero-configuration entity extraction.

It ships with two extraction engines:

	`NERExtractor`	`FastNERExtractor`
Approach	LLM-powered (Gemini, Ollama)	Dictionary-based (YAML gazetteers)
Speed	~10-60s per abstract	~0.4-1s per abstract
Novel entities	Discovers new entities	Known terms only
Context awareness	Full contextual understanding	String matching (exact + fuzzy)
Cost	API costs or local GPU	Free (no API calls)
Setup	API key or Ollama	Zero config
Output format	`NERResult`	`NERResult` (identical)

Installation

pip install structflo-ner

# or with uv
uv add structflo-ner

Install optional extras as needed:

pip install "structflo-ner[dataframe]"   # pandas DataFrame support
pip install "structflo-ner[fast]"         # fast dictionary-based NER (rapidfuzz)

LLM-Powered Extraction

Cloud model (Gemini)

The default model is gemini-2.5-flash. Pass your API key or set the GEMINI_API_KEY environment variable.

from structflo.ner import NERExtractor

extractor = NERExtractor(api_key="YOUR_GEMINI_KEY")

result = extractor.extract(
    "Gefitinib (ZD1839) is a first-generation EGFR tyrosine kinase inhibitor "
    "with IC50 = 0.033 µM, approved for non-small cell lung cancer (NSCLC). "
    "Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1."
)

Local models via Ollama

Run extraction entirely on your own hardware — no API key needed:

extractor = NERExtractor(
    model_id="qwen2.5:72b",
    model_url="http://localhost:11434",
    )
text = ("Gefitinib (ZD1839) is a first-generation EGFR inhibitor with IC50 = 0.033 µM approved for NSCLC."
        "Its SMILES is COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1.")
result = extractor.extract(text)
result

Any model served by Ollama works gemma, llama, mistral, qwen, deepseek, etc.

Render results as color-coded, interactive HTML directly in Jupyter notebooks: Results

To get a PANDAS dataframe.

result.to_dataframe()

Results

TB specific extractor pass in the profile=TB

from structflo.ner import NERExtractor, TB

extractor = NERExtractor(
    model_id="qwen2.5:72b",
    model_url="http://localhost:11434",
    profile=TB,

text = (
    "Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
    "mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
    "It shows potent activity against Mycobacterium tuberculosis "
    "including MDR-TB and XDR-TB. This compound was identified through "
    "whole-cell screening and targets the energy metabolism pathway."
)
result = extractor.extract(text)
result

Results

# Flat list of all entities
for entity in result.all_entities():
    print(f"{entity.entity_type:20s} | {entity.text}")

compound_name        | Bedaquiline
compound_name        | TMC207
target               | ATP synthase subunit c
disease              | MDR-TB
disease              | XDR-TB
accession_number     | Rv1305
functional_category  | energy metabolism pathway
screening_method     | whole-cell screening

Batch extraction

Pass a list of texts to extract from multiple documents at once:

texts = [
    "Imatinib inhibits BCR-ABL with IC50 = 0.6 µM in CML.",
    "Trastuzumab targets HER2 in breast cancer patients.",
    "Remdesivir (GS-5734) is an antiviral with EC50 = 0.77 µM against SARS-CoV-2.",
]

results = extractor.extract(texts)

--- Text 1 ---
  compound_name        | Imatinib
  target               | BCR-ABL
  disease              | CML
  bioactivity          | IC50 = 0.6 µM

--- Text 2 ---
  compound_name        | Trastuzumab
  target               | HER2
  disease              | breast cancer

--- Text 3 ---
  compound_name        | Remdesivir
  compound_name        | GS-5734
  disease              | SARS-CoV-2
  bioactivity          | EC50 = 0.77 µM

Fast Dictionary-Based NER (Mode 2)

FastNERExtractor uses curated YAML gazetteers with a three-phase matching strategy for deterministic, high-speed extraction when LLMs are not available. These run extremely fast, however they are fuzzy based matches to predefined patterns and so it does not understands context.

from structflo.ner.fast import FastNERExtractor

fast = FastNERExtractor()

text = (
    "Bedaquiline (TMC207) is a diarylquinoline that inhibits the "
    "mycobacterial ATP synthase subunit c encoded by atpE (Rv1305). "
    "It shows potent activity against Mycobacterium tuberculosis "
    "including MDR-TB and XDR-TB. This compound was identified through "
    "whole-cell screening and targets the energy metabolism pathway."
)

result = fast.extract(text)
result

Results

How matching works

Phase	Method	What it catches
1	Exact match	Case-sensitive and normalized dictionary lookups with word-boundary enforcement
1b	Regex patterns	Auto-derived patterns from accession number seeds (Rv tags, UniProt, PDB, etc.)
2	Fuzzy match	Typos and minor variants via rapidfuzz (configurable threshold)

# Fuzzy matching catches typos
result = fast.extract("Bedaquilne showed activity against TB")
# "Bedaquilne" -> canonical: "Bedaquiline" (method: fuzzy)

# Disable fuzzy matching for strict mode
strict = FastNERExtractor(fuzzy_threshold=0)

Built-in gazetteers

The fast extractor ships with curated gazetteers for TB drug discovery:

Gazetteer	Examples
`accession_number`	Rv1305, B586_RS00005
`gene_name`	atpE, InhA, DprE1
`screening_method`	whole-cell screening, fragment-based screening
`target`	InhA, DprE1, MmpL3
`compound_name`	Bedaquiline, Delamanid, Pretomanid
`functional_category`	DNA replication, cell wall biosynthesis
`strain`	M. tuberculosis H37Rv
`product`	enoyl-ACP reductase, ATP synthase subunit c
`disease`	TB, MDR-TB, XDR-TB

Custom gazetteers

Extend the built-in dictionaries with your own terms:

custom = FastNERExtractor(
    extra_gazetteers={
        "target": ["MyNovelTarget", "KinaseX"],
        "compound_name": ["CompoundABC"],
    }
)

Or drop a new YAML file into the gazetteers directory — the filename (without .yml) maps to an entity type.

Performance

Single abstract:  ~393 ms
8 abstracts:      ~862 ms

Profiles

Profiles control which entity types are extracted. Use them to focus the model on specific categories.

Built-in profiles

Profile	Entity classes
`FULL` (default)	compounds, targets, diseases, bioactivities, assays, mechanisms
`CHEMISTRY`	compound names, SMILES, CAS numbers, molecular formulas
`BIOLOGY`	targets, gene names, protein names
`BIOACTIVITY`	bioactivity measurements, assays
`DISEASE`	diseases and clinical indications
`TB`	TB drug discovery (compounds, targets, diseases, accessions, strains, screening methods, functional categories)

from structflo.ner import NERExtractor, CHEMISTRY

extractor = NERExtractor(api_key="YOUR_GEMINI_KEY")
result = extractor.extract(text, profile=CHEMISTRY)

Merging profiles

Combine multiple profiles for broader extraction:

from structflo.ner import CHEMISTRY, BIOLOGY

combined = CHEMISTRY.merge(BIOLOGY)
result = extractor.extract(text, profile=combined)
# Profile: chemistry+biology
# Entity classes: compound_name, smiles, cas_number, molecular_formula, target, gene_name, protein_name

Custom profiles

Define your own extraction schema:

from structflo.ner import NERExtractor, EntityProfile

my_profile = EntityProfile(
    name="kinase_inhibitors",
    entity_classes=["compound_name", "smiles", "target", "bioactivity"],
    prompt="Extract kinase inhibitor names, SMILES, targets, and potency values.",
    examples=my_examples,
)
result = extractor.extract(text, profile=my_profile)

Working with Results

Both extractors return identical NERResult objects:

# Typed entity lists
result.compounds        # [ChemicalEntity(...)]
result.targets          # [TargetEntity(...)]
result.diseases         # [DiseaseEntity(...)]
result.bioactivities    # [BioactivityEntity(...)]
result.assays           # [...]
result.mechanisms       # [...]
result.accessions       # [AccessionEntity(...)]

# Flat list of all entities
result.all_entities()

# Export to pandas DataFrame
df = result.to_dataframe()

# Serialize to dict (JSON-friendly)
result.to_dict()

Notebooks

Explore worked examples in the notebooks/ directory:

Notebook	Description
01_quickstart.ipynb	End-to-end extraction with cloud and local models, profiles, batch extraction
02_fast_ner.ipynb	Fast dictionary-based NER — matching strategies, custom gazetteers, performance

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

# clone and install dev dependencies
git clone https://github.com/structflo/structflo-ner.git
cd structflo-ner
pip install -e ".[dataframe]" --group dev

# run tests
pytest

# lint
ruff check .
ruff format .

Citation

If you use structflo.ner in your research, please cite:

BibTeX

@software{structflo_ner,
  title  = {structflo.ner: Zero-config NER for Drug Discovery},
  url    = {https://github.com/structflo/structflo-ner},
  year   = {2026}
}

License

This project is licensed under the Apache License 2.0.

Project details

Release history Release notifications | RSS feed

This version

0.3.0

Mar 2, 2026

0.2.3

Feb 16, 2026

0.2.2

Feb 15, 2026

0.2.1

Feb 15, 2026

0.2.0

Feb 15, 2026

0.1.1

Feb 15, 2026

0.1.0

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structflo_ner-0.3.0.tar.gz (2.7 MB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structflo_ner-0.3.0-py3-none-any.whl (257.5 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file structflo_ner-0.3.0.tar.gz.

File metadata

Download URL: structflo_ner-0.3.0.tar.gz
Upload date: Mar 2, 2026
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structflo_ner-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a2c71ffe0b8c0ed2ab8cbce29f53caf4bb16d2f82fb43cd670c3a59c26040871`
MD5	`8ec0584af43a14c421c619645105608b`
BLAKE2b-256	`d6ef3c57f4d6367382c16c928ef70e2addfa9f9ae5b6e86a14c8e23fd7c798dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structflo_ner-0.3.0.tar.gz:

Publisher: publish.yml on structflo/structflo-ner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structflo_ner-0.3.0.tar.gz
- Subject digest: a2c71ffe0b8c0ed2ab8cbce29f53caf4bb16d2f82fb43cd670c3a59c26040871
- Sigstore transparency entry: 1009555425
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: structflo/structflo-ner@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/structflo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc
- Trigger Event: push

File details

Details for the file structflo_ner-0.3.0-py3-none-any.whl.

File metadata

Download URL: structflo_ner-0.3.0-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 257.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for structflo_ner-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c3efdafc394f1635aeb2fd6b75ea1929cd92fd9de7a304a64f616bbcebb2f5a`
MD5	`44f294549523ec8b6906450b1966856c`
BLAKE2b-256	`bbbc7aa821b4d5587bea6980b544740e7771a2422dfe64b3f6da1ceb6a28e915`

See more details on using hashes here.

Provenance

The following attestation bundles were made for structflo_ner-0.3.0-py3-none-any.whl:

Publisher: publish.yml on structflo/structflo-ner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: structflo_ner-0.3.0-py3-none-any.whl
- Subject digest: 0c3efdafc394f1635aeb2fd6b75ea1929cd92fd9de7a304a64f616bbcebb2f5a
- Sigstore transparency entry: 1009555473
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: structflo/structflo-ner@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/structflo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@35cbcb3e6397bb08f1996c5cc0aeeaa11fffd6dc
- Trigger Event: push

structflo-ner 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

structflo.ner

Installation

LLM-Powered Extraction

Cloud model (Gemini)

Local models via Ollama

TB specific extractor pass in the profile=TB

Batch extraction

Fast Dictionary-Based NER (Mode 2)

How matching works

Built-in gazetteers

Custom gazetteers

Performance

Profiles

Built-in profiles

Merging profiles

Custom profiles

Working with Results

Notebooks

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance