structsense is a powerful multi-agent system designed to extract structured information from unstructured data. By orchestrating intelligent agents, it helps you make sense of complex information — hence the name structsense. Whether you're working with scientific texts, documents, or messy data, structsense enables you to transform it into meaningful, structured insights.

These details have not been verified by PyPI

Project links

Project description

🧩 StructSense

Welcome to structsense!

structsense is a powerful multi-agent system designed to extract structured information from unstructured data. By orchestrating intelligent agents, it helps you make sense of complex information — hence the name structsense.

Whether you're working with scientific texts, documents, or messy data, structsense enables you to transform it into meaningful, structured insights.

Caution:: this package is still under development and may change rapidly over the next few weeks.

🏗️ Architecture

The below is the architecture of the StructSense.

🚀 Features

🔍 Multi-agent architecture for modular processing
- 📑 Extraction of (structured) information from text--based on configuration
- 🤝 Collaboration between agents
- ⚙️ Easy use
- 🧠 Designed as general purpose domain agnostic framework

🧠 Example Use Cases

Entity and relation extraction from text
- Knowledge graph construction

📁Examples

Using openrouter/
You need the openrouter API key
Using Ollama
- Install ollama and pull the models which you intend to use. This example uses deepseek-r1:14b model. You can get it from ollama by running ollama pull deepseek-r1:14b command. If you want to use different models, e.g., llama3.2:latest, you need to pull it similar to deepseek-r1:14b. Make sure that ollama is running. You can run ollama using ollama serve.

📄 Requirements

📄 PDF Extraction Configuration

By default, the system uses the local Grobid service for PDF content extraction. If you have Grobid installed locally, no additional setup is required — everything is preconfigured for local usage.

Grobid Installation via Docker

docker pull lfoppiano/grobid:0.8.0
docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0

JAVA_OPTS="-XX:+UseZGC" helps to resolve the following error in MAC OS.

🔧 Using a Remote Grobid Server

If you're running Grobid on a remote server, set the following environment variable:

GROBID_SERVER_URL_OR_EXTERNAL_SERVICE=http://your-remote-grobid-server:PORT

🌐 Using an External PDF Extraction API

If you prefer to use an external PDF extraction API service, you must:

Set the API endpoint:

  GROBID_SERVER_URL_OR_EXTERNAL_SERVICE=https://api.SOMEAPIENDPOINT.com/api/extract

Enable the external API mode:
```
  EXTERNAL_PDF_EXTRACTION_SERVICE=True
```

Note: At the moment, the external API is assumed to be publicly accessible and does not require authentication (e.g., no JWT token or API key). Support for authenticated requests may be added in future versions.

📄 Configuration

structsense supports flexible customization through both environment variables and a YAML configuration file.

The YAML config can be passed as a parameter (e.g., --agentconfig config/ner_agent.yaml), allowing you to define models, agents, and behaviors specific to your use case.

🔧 Environment Variables

You need to set the following environment variables (e.g., in a .env file). WEAVIATE is a vector database that we use to store the knolwledge, which in our case is the ontology/schemas.

WEAVIATE related environment variables are only necessary if you want to use vector database as a knowledge source.

🧠 Core Keys

Variable	Description	Default
`ENABLE_KG_SOURCE`	Enable access to knowledge source, i.e., vector database.	`false`
`WEAVIATE_API_KEY`	Required. API key for Weaviate access	—

🌐 Weaviate Configuration

This configuration is optional and only necessary if you plan to integrate a knowledge source (e.g., a vector store) into the pipeline.

Variable	Description	Default
`WEAVIATE_HTTP_HOST`	HTTP host for Weaviate	`localhost`
`WEAVIATE_HTTP_PORT`	HTTP port for Weaviate	`8080`
`WEAVIATE_HTTP_SECURE`	Use HTTPS for HTTP connection (`true/false`)	`false`
`WEAVIATE_GRPC_HOST`	gRPC host for Weaviate	`localhost`
`WEAVIATE_GRPC_PORT`	gRPC port for Weaviate	`50051`
`WEAVIATE_GRPC_SECURE`	Use secure gRPC (`true/false`)	`false`

🧪 Weaviate Timeouts

Variable	Description	Default
`WEAVIATE_TIMEOUT_INIT`	Timeout for initialization (in seconds)	`30`
`WEAVIATE_TIMEOUT_QUERY`	Timeout for query operations (in seconds)	`60`
`WEAVIATE_TIMEOUT_INSERT`	Timeout for data insertions (in seconds)	`120`

🤖 Ollama Configuration for WEAVIATE

Variable	Description	Default
`OLLAMA_API_ENDPOINT`	API endpoint for Ollama model	`http://host.docker.internal:11434`
`OLLAMA_MODEL`	Name of the Ollama embedding model	`nomic-embed-text`

⚠️ Note: If ollama is running in host machine and vector database, i.e., WEAVIATE, in docker, then we use http://host.docker.internal:11434, which is also the default value. However, if both are running in docker in the same host, use http://localhost:11434 .

🧵 Optional: Experiment Tracking

Variable	Description	Default
`ENABLE_WEIGHTSANDBIAS`	Enable Weights & Biases monitoring (`true/false`)	`false`
`ENABLE_MLFLOW`	Enable MLflow logging (`true/false`)	`false`
`MLFLOW_TRACKING_URL`	MLflow tracking server URL	`http://localhost:5000`

⚠️ Note: WEAVIATE_API_KEY is required for structsense to run. If it's not set, the system will raise an error. For Weights & Biases you need to create a project and provide it's key.

# Example .env file

WEAVIATE_API_KEY=your_api_key
WEAVIATE_HTTP_HOST=localhost
WEAVIATE_HTTP_PORT=8080
WEAVIATE_HTTP_SECURE=false

WEAVIATE_GRPC_HOST=localhost
WEAVIATE_GRPC_PORT=50051
WEAVIATE_GRPC_SECURE=false

WEAVIATE_TIMEOUT_INIT=30
WEAVIATE_TIMEOUT_QUERY=60
WEAVIATE_TIMEOUT_INSERT=120

OLLAMA_API_ENDPOINT=http://host.docker.internal:11434
OLLAMA_MODEL=nomic-embed-text

ENABLE_WEAVE=true
ENABLE_MLFLOW=true
MLFLOW_TRACKING_URL=http://localhost:5000

🛠️ Minimum Required Environment Variables

Below are the minimum required environment variables to run structsense.
This configuration assumes all other optional variables will use their default values.

In this minimal setup:

🚫 Weights & Biases is disabled
- 🚫 MLflow tracking is disabled
- 🚫 Knowledge source integration is disabled
- 📦 As a result, no vector database (e.g., Weaviate) is used

ENABLE_WEIGHTSANDBIAS=false
ENABLE_MLFLOW=false
ENABLE_KG_SOURCE=false

📄 YAML Configuration

In order to run structsense you need 4 YAML configuration files.

The first is the agent configuration.

The agent configuration. You can define as many agents as you want, we process it dynamically.

Example agent configuration.

  extractor_agent:
    role: >
      Neuroscience Named Entity Recognition (NER) Extractor Agent
    goal: >
      Perform Named Entity Recognition (NER) on neuroscience {literature} and return structured JSON output.
    backstory: >
      You are an AI assistant specialized in processing neuroscience and who do not hallucinate. 
      Your expertise includes recognizing and categorizing named entities such as anatomical regions, experimental conditions, and cell types. 
      Your responses strictly adhere to JSON format, ensuring accurate and structured data extraction for downstream applications.
    llm:
      model: openrouter/openai/gpt-4o-2024-11-20
      base_url: https://openrouter.ai/api/v1
      frequency_penalty: 0.1
      temperature: 0.7
      seed: 53
      api_key: sk-or-v1-
  
  alignment_agent:
    role: >
      Neuroscience Named Entity Recognition (NER) Concept Alignment Agent
    goal: >
      Perform concept alignment to the extracted Named Entity Recognition (NER) by extractor_agent {extracted_structured_information} and return structured JSON output.
    backstory: >
      You are an AI assistant specialized in processing neuroscience concept alignment with structured models, i.e., ontologies or schemas and who do not hallucinate. 
      Your expertise includes recognizing and categorizing extracted named entities such as anatomical regions, experimental conditions, and cell types and aligning the recognized named entities such as cell types with corresponding ontological terms. 
      Your responses strictly adhere to JSON format, ensuring accurate and structured data extraction for downstream applications.
    llm:
      model: openrouter/openai/gpt-4o-2024-11-20
      base_url: https://openrouter.ai/api/v1
      frequency_penalty: 0.1
      temperature: 0.7
      seed: 53
      api_key: sk-or-v1-
  
  judge_agent:
    role: >
      Neuroscience Named Entity Recognition (NER) Judge Agent
    goal: >
      Evaluate the {aligned_structured_information} based on predefined criteria and generate a structured JSON output reflecting the assessment results.
    backstory: >
      You are an AI assistant with expert knowledge in neuroscience and structured models, i.e., ontologies or schemas, and someone who does not hallucinate.  
      Your task is to evaluate the {aligned_structured_information} based on the accuracy and quality of the alignment. 
      Assign the score between 0-1 with 1 being the highest score of your evaluation.
      Your responses strictly adhere to JSON format, ensuring accurate and structured data extraction for downstream applications.
    llm:
      model: openrouter/openai/gpt-4o-2024-11-20
      base_url: https://openrouter.ai/api/v1
      frequency_penalty: 0.1
      temperature: 0.7
      seed: 53
      api_key: sk-or-v1-

In the YAML file:
- ID: Unique identifier
- Goal: Task to be performed
- LLM config: Model, base URL, temperature, etc.
- Backstory: Background knowledge the agent leverages
- Output variable: Result name for the next agent/task

For further details, refer to Role-Goal-Backstory

The second is the task configuration.

Task configuration allows you to describes the tasks for the agent.

Example task configuration.

 extraction_task:
   description: >
     From the given literature extract named entities from neuroscience statements.
     A named entity is anything that can be referred to with a proper name.
     Some common named entities in neuroscience articles are animal species (e.g., mouse, drosophila, zebrafish),
     anatomical regions (e.g., neocortex, mushroom body, cerebellum), experimental conditions (e.g., control, tetrodotoxin treatment, Scn1a knockout),
     and cell types (e.g., pyramidal neuron, direction-sensitive mechanoreceptor, oligodendrocyte)
 
     Literature:
     {literature}
   expected_output: >
     output format: json
     Example output:
     {
       "extracted_terms": {
         "1": [
           {
             "entity": "mouse",
             "label": "ANIMAL_SPECIES",
             "sentence": "These particles were visualized by fluorescent immunohistochemistry using mouse monoclonal anti-human myelin basic protein (MBPh) antibody (clone SMI-99).",
             "start": 79,
             "end": 84,
             "paper_location": "methods",
             "paper_title": "Concentration of myelin debris-like myelin basic protein-immunoreactive particles in the distal (anterior)-most part of the myelinated region in the normal rat optic nerve",
             "doi": "10.1101/2025.03.19.643597"
           }
         ]
       }
     }
   agent_id: extractor_agent
 
 alignment_task:
   description: >
     Take the output of extractor_agent {extracted_structured_information} as input and perform the concept alignment using the ontological concepts.
     A concept alignment is where you align the given entity to the matching concept or class from an ontology or schema.
   expected_output: >
     output format: json
     Example output:
     {
       "aligned_ner_terms": {
         "1": [
           {
             "entity": "oligodendrocyte",
             "label": "CELL_TYPE",
             "ontology_id": "CL:0000128",
             "ontology_label": "Oligodendrocyte",
             "sentence": "Individual oligodendrocytes provide...",
             "start": 14,
             "end": 29,
             "paper_location": "discussion",
             "paper_title": "Concentration of myelin debris-like...",
             "doi": "10.1101/2025.03.19.643597"
           }
         ] 
       }
     }
   agent_id: alignment_agent
 
 judge_task:
   description: >
     Take the output of alignment agent {aligned_structured_information} as input and perform the following evaluation:
     1. Assess the quality and accuracy of the alignment with the ontology or schema.
     2. Assign a score between 0 and 1 as a judge_score.
     3. Update the {aligned_structured_information} by adding the judge_score.
   expected_output: >
     output format: json
     Example output:
     {
       "judge_ner_terms": {
         "1": [
           {
             "entity": "oligodendrocyte",
             "label": "CELL_TYPE",
             "ontology_id": "CL:0000128",
             "ontology_label": "Oligodendrocyte",
             "judge_score": "0.8",
             "sentence": "Individual oligodendrocytes provide...",
             "start": 14,
             "end": 29,
             "paper_location": "discussion",
             "paper_title": "Concentration of myelin debris-like...",
             "doi": "10.1101/2025.03.19.643597"
           }
         ]
       }
     }
   agent_id: judge_agent

Each task links to a specific agent via agent_id and defines:
- Description: What the task does
- Input/Output schema: Example output structure in JSON
- Agent Link: Tied to an id from agents configuration
⚠️ Note: The variables {variable_name} are replaced at the run-time. Also, pay attention in the tasks where we are using the output variable defined in agent configuration.

The third is the embedding configuration For more about the different embedding configurations using different provider see https://docs.crewai.com/concepts/memory#additional-embedding-providerscl.
```
embedder_config:
  provider: ollama
  config:
    api_base: http://localhost:11434
    model: nomic-embed-text:latest
```
The fourth and the final one is the search configuration, where we define the search keys. Since the ontology/schemas are our current knowledge source, which is why you see the label and entity as search key. This is optional if you do not use knowledge source.
```
search_key: #local vector database
  - entity
  - label
```

Fifth is the human in the loop config (optional).

  extractor_agent: false
  alignment_agent: false
  judge_agent: false
  humanfeedback_agent: true

📦 Installation

Install this package via :

pip install structsense

Or get the newest development version via:

pip install git+https://github.com/sensein/ner_framework.git

🧪 CLI Usage

You can run StructSense using the CLI tool structsense-cli. Below are a few examples showing different ways to provide input.

📄 1. Extract from a PDF file (with knowledge source)

structsense-cli extract \ 
  --agentconfig config/ner_agent.yaml \
  --taskconfig config/ner_task.yaml \
  --embedderconfig config/embedding.yaml \
  --knowledgeconfig config/search_ontology_knowledge.yaml \
  --enable_human_feedback true \
  --agent_feedback_config config/human_in_loop.yaml \
  --source somefile.pdf

💬 2. Extract from raw text (with knowledge source)

structsense-cli extract \
  --agentconfig config/ner_agent.yaml \
  --taskconfig config/ner_task.yaml \
  --embedderconfig config/embedding.yaml \
  --knowledgeconfig config/search_ontology_knowledge.yaml \
  --enable_human_feedback true \
  --agent_feedback_config config/human_in_loop.yaml \
  --source "Additionally, mutations in the APOE gene have been linked to neurodegenerative disorders, impacting astrocytes and microglia function."

⚡ 3. Extract from raw text without human loop

structsense-cli extract \
  --agentconfig config/ner_agent.yaml \
  --taskconfig config/ner_task.yaml \
  --embedderconfig config/embedding.yaml \
  --knowledgeconfig config/search_ontology_knowledge.yaml \
  --enable_human_feedback false \ 
  --source "Additionally, mutations in the APOE gene have been linked to neurodegenerative disorders, impacting astrocytes and microglia function."

In progress

More examples (e.g., using ollama)
Validations (e.g., benchmarking)
Human feedback component.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.5

Apr 15, 2026

0.0.4

Nov 3, 2025

0.0.3

Oct 30, 2025

0.0.2

Jun 26, 2025

This version

0.0.1

Jun 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structsense-0.0.1.tar.gz (46.9 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structsense-0.0.1-py3-none-any.whl (47.3 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file structsense-0.0.1.tar.gz.

File metadata

Download URL: structsense-0.0.1.tar.gz
Upload date: Jun 2, 2025
Size: 46.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.11.7 Darwin/23.6.0

File hashes

Hashes for structsense-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`5506922d7208f694ac6030d463d94419579e6dbb3965667abae0ec4674e0d691`
MD5	`2dbe0a1e700ed365a029725433a20b8e`
BLAKE2b-256	`0ac39b3191e1942367468e4f2be1999f75f14cde83b7196c2aabd330c03ca897`

See more details on using hashes here.

File details

Details for the file structsense-0.0.1-py3-none-any.whl.

File metadata

Download URL: structsense-0.0.1-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 47.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.11.7 Darwin/23.6.0

File hashes

Hashes for structsense-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5f6147ebc6bdcc8083a8d089ce49af6049e5139316d2600c7be8614efdbf04c`
MD5	`deeefa9ce4ffae3356eb61b9654010f0`
BLAKE2b-256	`ca97e18711e2f5cfa871567a84cd4371371d281ea7ea170b6b9410a517d198a7`

See more details on using hashes here.

structsense 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧩 StructSense

🏗️ Architecture

🚀 Features

🧠 Example Use Cases

📁Examples

📄 Requirements

📄 PDF Extraction Configuration

🔧 Using a Remote Grobid Server

🌐 Using an External PDF Extraction API

📄 Configuration

🔧 Environment Variables

🧠 Core Keys

🌐 Weaviate Configuration

🧪 Weaviate Timeouts

🤖 Ollama Configuration for WEAVIATE

🧵 Optional: Experiment Tracking

🛠️ Minimum Required Environment Variables

📄 YAML Configuration

📦 Installation

🧪 CLI Usage

📄 1. Extract from a PDF file (with knowledge source)

💬 2. Extract from raw text (with knowledge source)

⚡ 3. Extract from raw text without human loop

In progress

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes