Skip to main content

A Python library for Crystal Parquet Database.

Project description

Crystal-Parquet-Database

Crystal Parquet Database (crystpqdb) is a Python library to build a unified local database of crystal structures by downloading datasets from multiple sources (Alexandria, Materials Project, Materials Cloud, andJARVIS) into a consistent on-disk layout.

Installation

1. PyPi

pip install crystpqdb

2. Manually

To install and use this package we use conda package manager for conda packages and Pixi to handle package depenedcies and virtual environements.

1. Install Miniforge

Miniforge is the community (conda-forge) driven minimalistic conda installer. Subsequent package installations come thus from conda-forge channel.

This is in comparison to Miniconda is the Anaconda (company) driven minimalistic conda installer. Subsequent package installations come from the anaconda channels (default or otherwise).

Download here

2. Install Pixi package manager

Linux/macOS

wget -qO- https://pixi.sh/install.sh | sh

Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"

3. Cloning the repo

git clone https://github.com/YKK-xTechLab-Engineering/YKK-Point-Cloud.git

4. Install dependencies and virtual environments through Pixi

pixi install

Quickstart

All downloads are created via a small factory and a per-source DownloadConfig.

1. Download the database combinded database

from pathlib import Path
from crystpqdb import download

data_root = Path("./data")
db_dir = data_root / "crystpqdb"
db_dir = download(db_dir)
print("Downloaded to: {}".format(db_dir))

2. Uses the Loaders to download datasets from different sources

This package uses defines a common BaseLoader interface to download datasets and transform them into a unified schema.

A factory method LoaderFactory or get_loader is used to get the correct loader for a given source and dataset. The name of the source_database and source_dataset are used to get the correct loader. If you do not know the name of the source and dataset, you can use the LoaderFactory to list all available sources and datasets, or and error will be raised and it will list the available sources databases and datasets.

import os
from crystpqdb.loaders import get_loader, LoaderConfig

# Define Configurations for the loader
config = LoaderConfig(
    api_key=os.getenv("MP_API_KEY"),
    download_from_scratch=False,
    ingest_from_scratch=True,
    transform_from_scratch=True
    )

# Get the loader
loader = get_loader("mp", "summary", data_dir=data_root, config=config)

# Run the loader
table = loader.run()
print(table.shape)

3. Loading all datasets into a single ParuqetDB

import os
from pathlib import Path
from parquetdb import ParquetDB

from crystpqdb.loaders import get_loader, LoaderConfig

datasets = [
    ("alex", "3d"),
    ("alex", "2d"),
    ("alex", "1d"),
    ("mp", "summary"),
    ("materialscloud", "mc3d"),
]

for source_database, source_dataset in datasets:
    loader = get_loader(source_database, source_dataset, data_dir=data_dir)
    table = loader.run()
    pqdb.create(table, convert_to_fixed_shape=False)

table = pqdb.read(columns = ["id"])
print(table.shape)

Note: This requires alot of memory (~64GB RAM) to load all the datasets into a single ParquetDB. Batch support is not yet implemented.

Current Loaders

Loader Class (source_database, source_dataset) Working?
Alexandria1DLoader ("alex", "1d")
Alexandria2DLoader ("alex", "2d")
Alexandria3DLoader ("alex", "3d")
MPLoader ("mp", "summary")
MC3DLoader ("materialscloud", "mc3d")
JarvisLoader

All listed loaders are currently implemented and functional. If you attempt to use a (source_database, source_dataset) pair not in this table, a ValueError will be raised and the available options will be listed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crystpqdb-0.0.1.dev31.tar.gz (302.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crystpqdb-0.0.1.dev31-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file crystpqdb-0.0.1.dev31.tar.gz.

File metadata

  • Download URL: crystpqdb-0.0.1.dev31.tar.gz
  • Upload date:
  • Size: 302.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for crystpqdb-0.0.1.dev31.tar.gz
Algorithm Hash digest
SHA256 80185404b7f2e049615fa70ef10af1bf610022538ecc8eb1b06cb7e821aa9cb3
MD5 bdef442bf41109fb2eddefb40511a605
BLAKE2b-256 664c205f695e089498f6fe6a7ad761983bbcbf067203d21c8658b60a41f365e7

See more details on using hashes here.

File details

Details for the file crystpqdb-0.0.1.dev31-py3-none-any.whl.

File metadata

File hashes

Hashes for crystpqdb-0.0.1.dev31-py3-none-any.whl
Algorithm Hash digest
SHA256 0d1f9b0b6c1e5d73a090b281145d83b0a9c3c0512e0a957d98f9eb09e3f67998
MD5 da76f5de3e3e11dcae40f04bdcd3c27c
BLAKE2b-256 86caf64b20879552ae162880bcc04f048ac39f6c8040a4a2f59140b2bdf3d9aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page