Skip to main content

UniProt reader for LlamaIndex

Project description

UniProt Reader for LlamaIndex

This package provides a reader for UniProt Swiss-Prot format files, allowing you to load protein data into LlamaIndex for further processing and analysis.

Features

  • Efficient parsing of large UniProt files with optional lazy loading.
  • Structured output with both text containing entire UniProt record and metadata containing protein ID.
  • Configurable field selection

Installation

pip install llama-index-readers-uniprot

Usage

from llama_index.readers.uniprot import UniProtReader

# Initialize the reader
reader = UniProtReader()

# Load data from a UniProt file
documents = reader.load_data("path/to/uniprot_sprot.dat")

# Access the documents
for doc in documents:
    print(f"Protein ID: {doc.metadata['id']}")

Lazy Loading for Large Files

Since UniProt files are large (several GB) it's recommended to use lazy loading to process records one at a time, without loading the entire database into memory:

# Initialize the reader
reader = UniProtReader()

# Load data lazily from a UniProt file
for doc in reader.lazy_load_data("path/to/uniprot_sprot.dat"):
    print(f"Protein ID: {doc.metadata['id']}")
    print("---")

Example of building an index from a lazy loaded UniProt file

from llama_index.readers.uniprot import UniProtReader
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

reader = UniProtReader(max_records=10000)

# Load existing protein IDs from the index
existing_protein_ids = {
    node.metadata.get('id')
    for node in index.storage_context.docstore.docs.values()
    if node.metadata.get('id')
}

text_splitter = SentenceSplitter(chunk_size=2048)
index = VectorStoreIndex([], transformations=[text_splitter], show_progress=True)
documents_gen = reader.lazy_load_data("path/to/uniprot_sprot.dat")

# Process documents in batches
batch_size = 10
current_batch = []

for doc in documents_gen:
  protein_id = doc.metadata.get('id')

  if protein_id in existing_protein_ids:
    print(f"Skipping document {protein_id} - already indexed")
    continue


  current_batch.append(doc)

  if len(current_batch) >= batch_size:
      index.refresh_ref_docs(documents=current_batch)
      current_batch = []

# Process any remaining documents
if current_batch:
    index.refresh_ref_docs(documents=current_batch)

# Define persist directory
persist_dir = "path/to/persist/directory"
index.storage_context.persist(persist_dir=persist_dir)

Customizing Field Selection

You can specify which fields to include in the output:

# Only include specific fields
reader = UniProtReader(include_fields={"id", "description", "sequence"})
documents = reader.load_data("path/to/uniprot_sprot.dat")

Available fields:

  • id: Protein identifier
  • accession: Accession numbers
  • description: Protein description
  • gene_names: Gene names
  • organism: Organism name
  • comments: Comments and annotations
  • keywords: Keywords
  • sequence_length: Length of the protein sequence
  • sequence_mw: Molecular weight of the protein
  • taxonomy: Taxonomic classification
  • taxonomy_id: Taxonomic database identifiers
  • citations: Literature citations
  • cross_references: Cross-references to other databases
  • features: Protein features

By default, all fields are included.

Limiting Number of Records

You can limit the number of records to parse using the max_records parameter:

# Parse only first 1000 records
reader = UniProtReader(max_records=1000)
documents = reader.load_data("path/to/uniprot_sprot.dat")

# Works with lazy loading too
for doc in reader.lazy_load_data(
    "path/to/uniprot_sprot.dat", max_records=1000
):
    print(f"Protein ID: {doc.metadata['id']}")

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_uniprot-0.2.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_uniprot-0.2.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_uniprot-0.2.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_uniprot-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1bee9615b422c5aaf9202f44e30d4f0de55e5810fe9a8ec8923b189a9e14b917
MD5 3032ddb6bbc6b58be620d5fb2758aea4
BLAKE2b-256 f41ca20b5fcf59b5bae78e20ab13487ae66d6d8b0bd4fdf5fb35c1eb9d330d5d

See more details on using hashes here.

File details

Details for the file llama_index_readers_uniprot-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_uniprot-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3104ec193a758cda485f85f87b643ead91cfb23e98640ea4ef4091442850ad0
MD5 a9ae8c3a30b24540f7ce20e38cd02055
BLAKE2b-256 d4f3661662d71195959f6e490b81bce3e6a07f1cd2ebc9241674abbb52c90a83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page