A Python library for Crystal Parquet Database.
Project description
Crystal-Parquet-Database
Crystal Parquet Database (crystpqdb) is a Python library to build a unified local database of crystal structures by downloading datasets from multiple sources (Alexandria, Materials Project, Materials Cloud, andJARVIS) into a consistent on-disk layout.
Installation
1. PyPi
pip install crystpqdb
2. Manually
To install and use this package we use conda package manager for conda packages and Pixi to handle package depenedcies and virtual environements.
1. Install Miniforge
Miniforge is the community (conda-forge) driven minimalistic conda installer. Subsequent package installations come thus from conda-forge channel.
This is in comparison to Miniconda is the Anaconda (company) driven minimalistic conda installer. Subsequent package installations come from the anaconda channels (default or otherwise).
2. Install Pixi package manager
Linux/macOS
wget -qO- https://pixi.sh/install.sh | sh
Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"
3. Cloning the repo
git clone https://github.com/YKK-xTechLab-Engineering/YKK-Point-Cloud.git
4. Install dependencies and virtual environments through Pixi
pixi install
Quickstart
All downloads are created via a small factory and a per-source DownloadConfig.
1. Download the database combinded database
from pathlib import Path
from crystpqdb import download
data_root = Path("./data")
db_dir = data_root / "crystpqdb"
db_dir = download(db_dir)
print("Downloaded to: {}".format(db_dir))
2. Uses the Loaders to download datasets from different sources
This package uses defines a common BaseLoader interface to download datasets and transform them into a unified schema.
A factory method LoaderFactory or get_loader is used to get the correct loader for a given source and dataset. The name of the source_database and source_dataset are used to get the correct loader. If you do not know the name of the source and dataset, you can use the LoaderFactory to list all available sources and datasets, or and error will be raised and it will list the available sources databases and datasets.
import os
from crystpqdb.loaders import get_loader, LoaderConfig
# Define Configurations for the loader
config = LoaderConfig(
api_key=os.getenv("MP_API_KEY"),
download_from_scratch=False,
ingest_from_scratch=True,
transform_from_scratch=True
)
# Get the loader
loader = get_loader("mp", "summary", data_dir=data_root, config=config)
# Run the loader
table = loader.run()
print(table.shape)
3. Loading all datasets into a single ParuqetDB
import os
from pathlib import Path
from parquetdb import ParquetDB
from crystpqdb.loaders import get_loader, LoaderConfig
datasets = [
("alex", "3d"),
("alex", "2d"),
("alex", "1d"),
("mp", "summary"),
("materialscloud", "mc3d"),
]
for source_database, source_dataset in datasets:
loader = get_loader(source_database, source_dataset, data_dir=data_dir)
table = loader.run()
pqdb.create(table, convert_to_fixed_shape=False)
table = pqdb.read(columns = ["id"])
print(table.shape)
Note: This requires alot of memory (~64GB RAM) to load all the datasets into a single ParquetDB. Batch support is not yet implemented.
Current Loaders
| Loader Class | (source_database, source_dataset) | Working? |
|---|---|---|
| Alexandria1DLoader | ("alex", "1d") | ✅ |
| Alexandria2DLoader | ("alex", "2d") | ✅ |
| Alexandria3DLoader | ("alex", "3d") | ✅ |
| MPLoader | ("mp", "summary") | ✅ |
| MC3DLoader | ("materialscloud", "mc3d") | ✅ |
| JarvisLoader | ❌ |
All listed loaders are currently implemented and functional. If you attempt to use a (source_database, source_dataset) pair not in this table, a ValueError will be raised and the available options will be listed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crystpqdb-0.0.1.dev31.tar.gz.
File metadata
- Download URL: crystpqdb-0.0.1.dev31.tar.gz
- Upload date:
- Size: 302.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80185404b7f2e049615fa70ef10af1bf610022538ecc8eb1b06cb7e821aa9cb3
|
|
| MD5 |
bdef442bf41109fb2eddefb40511a605
|
|
| BLAKE2b-256 |
664c205f695e089498f6fe6a7ad761983bbcbf067203d21c8658b60a41f365e7
|
File details
Details for the file crystpqdb-0.0.1.dev31-py3-none-any.whl.
File metadata
- Download URL: crystpqdb-0.0.1.dev31-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d1f9b0b6c1e5d73a090b281145d83b0a9c3c0512e0a957d98f9eb09e3f67998
|
|
| MD5 |
da76f5de3e3e11dcae40f04bdcd3c27c
|
|
| BLAKE2b-256 |
86caf64b20879552ae162880bcc04f048ac39f6c8040a4a2f59140b2bdf3d9aa
|