Package to encode and decode crystal structures into text representations
Reason this release was yanked:
Imports of tokenizers broken
Project description
xtal2txt
Package to define, convert, encode and decode crystal structures into text representations
💪 Getting Started
🚀 Installation
The most recent code and data can be installed directly from GitHub with:
$ pip install git+https://github.com/lamalab-org/xtal2txt.git
Text Representation with xtal2txt
The TextRep
class in xtal2txt.core
facilitates the transformation of crystal structures into different text
representations. Below is an example of its usage:
from xtal2txt.core import TextRep
from pymatgen.core import Structure
# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")
# Initialize TextRep Class
text_rep = TextRep.from_input(structure)
requested_reps = [
"cif_p1",
"slices",
"atom_sequences",
"atom_sequences_plusplus",
"crystal_text_llm",
"zmatrix"
]
# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)
Using xtal2txt Tokenizers
By default, the tokenizer is initialized with \[CLS\]
and \[SEP\]
tokens. For an example, see the SliceTokenizer
usage:
from xtal2txt.tokenizer import SliceTokenizer
tokenizer = SliceTokenizer(
model_max_length=512,
truncation=True,
padding="max_length",
max_length=512
)
print(tokenizer.cls_token) # returns [CLS]
You can access the \[CLS\]
token using the [cls_token]{.title-ref}
attribute of the tokenizer. During decoding, you can utilize the
[skip_special_tokens]{.title-ref} parameter to skip these special
tokens.
Decoding with skipping special tokens:
tokenizer.decode(token_ids, skip_special_tokens=True)
Initializing tokenizers with custom special tokens
In scenarios where the \[CLS\]
token is not required, you can initialize
the tokenizer with an empty special_tokens dictionary.
Initialization without \[CLS\]
and \[SEP\]
tokens:
tokenizer = SliceTokenizer(
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
All Xtal2txtTokenizer
instances inherit from
PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.
Tokenizers with special number tokenization
The special_num_token
argument (by default False
) can be
set to true to tokenize numbers in a special way as designed and
implemented by
RegressionTransformer.
tokenizer = SliceTokenizer(
special_num_token=True,
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
👐 Contributing
Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.
👋 Attribution
⚖️ License
The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.
💰 Funding
This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.