Skip to main content

Package to encode and decode crystal structures into text representations

Project description

xtal2txt

Tests PyPI PyPI - Python Version

Package to define, convert, encode and decode crystal structures into text representations. xtal2txt is an important part of our MatText framework.

💪 Getting Started

🚀 Installation

The most recent release can be installed from PyPI with:

$ pip install xtal2txt

The most recent code and data can be installed directly from GitHub with:

$ pip install git+https://github.com/lamalab-org/xtal2txt.git

Text Representation with xtal2txt

The TextRep class in xtal2txt.core facilitates the transformation of crystal structures into different text representations. Below is an example of its usage:

from xtal2txt.core import TextRep
from pymatgen.core import Structure


# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")

# Initialize TextRep Class
text_rep = TextRep.from_input(structure)

requested_reps = [
        "cif_p1",
        "slices",
        "atom_sequences",
        "atom_sequences_plusplus",
        "crystal_text_llm",
        "zmatrix"
]

# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)

Using xtal2txt Tokenizers

By default, the tokenizer is initialized with \[CLS\] and \[SEP\] tokens. For an example, see the SliceTokenizer usage:

from xtal2txt.tokenizer import SliceTokenizer

tokenizer = SliceTokenizer(
                model_max_length=512, 
                truncation=True, 
                padding="max_length", 
                max_length=512
            )
print(tokenizer.cls_token) # returns [CLS]

You can access the \[CLS\] token using the [cls_token]{.title-ref} attribute of the tokenizer. During decoding, you can utilize the [skip_special_tokens]{.title-ref} parameter to skip these special tokens.

Decoding with skipping special tokens:

tokenizer.decode(token_ids, skip_special_tokens=True)

Initializing tokenizers with custom special tokens

In scenarios where the \[CLS\] token is not required, you can initialize the tokenizer with an empty special_tokens dictionary.

Initialization without \[CLS\] and \[SEP\] tokens:

tokenizer = SliceTokenizer(
                model_max_length=512, 
                special_tokens={}, 
                truncation=True,
                padding="max_length", 
                max_length=512
            )

All Xtal2txtTokenizer instances inherit from PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.

Tokenizers with special number tokenization

The special_num_token argument (by default False) can be set to true to tokenize numbers in a special way as designed and implemented by RegressionTransformer.

tokenizer = SliceTokenizer(
                special_num_token=True,
                model_max_length=512, 
                special_tokens={}, 
                truncation=True,
                padding="max_length", 
                max_length=512
            )

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.

💰 Funding

This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xtal2txt-0.1.0.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

xtal2txt-0.1.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file xtal2txt-0.1.0.tar.gz.

File metadata

  • Download URL: xtal2txt-0.1.0.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for xtal2txt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 08c9d94235a096de5d1e83c66769188e1076da6df8838ca09889d4531ee816d9
MD5 bcac68814ebc7c5c705966c54ea03308
BLAKE2b-256 36511ae508877fd0f8494dbf8cc3f0cfde9fe2b297d56c3b0626e036eb49d8f2

See more details on using hashes here.

File details

Details for the file xtal2txt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xtal2txt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for xtal2txt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b3ccaf569c829d2db4c7cd080910b9d282c467c93e2b2a37f242fa55ca366bd
MD5 b0bf4887891e00782ec116e8d784a7b0
BLAKE2b-256 a148c32f9425b91ab3d6942d2efe9b11f99b47c8b106f161def1b4de03c3adcc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page