Package to encode and decode crystal structures into text representations
Project description
xtal2txt
Package to define, convert, encode and decode crystal structures into text representations.
xtal2txt
is an important part of our MatText framework.
💪 Getting Started
🚀 Installation
The most recent release can be installed from PyPI with:
$ pip install xtal2txt
The most recent code and data can be installed directly from GitHub with:
$ pip install git+https://github.com/lamalab-org/xtal2txt.git
Text Representation with xtal2txt
The TextRep
class in xtal2txt.core
facilitates the transformation of crystal structures into different text
representations. Below is an example of its usage:
from xtal2txt.core import TextRep
from pymatgen.core import Structure
# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")
# Initialize TextRep Class
text_rep = TextRep.from_input(structure)
requested_reps = [
"cif_p1",
"slices",
"atom_sequences",
"atom_sequences_plusplus",
"crystal_text_llm",
"zmatrix"
]
# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)
Using xtal2txt Tokenizers
By default, the tokenizer is initialized with \[CLS\]
and \[SEP\]
tokens. For an example, see the SliceTokenizer
usage:
from xtal2txt.tokenizer import SliceTokenizer
tokenizer = SliceTokenizer(
model_max_length=512,
truncation=True,
padding="max_length",
max_length=512
)
print(tokenizer.cls_token) # returns [CLS]
You can access the \[CLS\]
token using the [cls_token]{.title-ref}
attribute of the tokenizer. During decoding, you can utilize the
[skip_special_tokens]{.title-ref} parameter to skip these special
tokens.
Decoding with skipping special tokens:
tokenizer.decode(token_ids, skip_special_tokens=True)
Initializing tokenizers with custom special tokens
In scenarios where the \[CLS\]
token is not required, you can initialize
the tokenizer with an empty special_tokens dictionary.
Initialization without \[CLS\]
and \[SEP\]
tokens:
tokenizer = SliceTokenizer(
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
All Xtal2txtTokenizer
instances inherit from
PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.
Tokenizers with special number tokenization
The special_num_token
argument (by default False
) can be
set to true to tokenize numbers in a special way as designed and
implemented by
RegressionTransformer.
tokenizer = SliceTokenizer(
special_num_token=True,
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
👐 Contributing
Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.
👋 Attribution
⚖️ License
The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.
💰 Funding
This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xtal2txt-0.1.0.tar.gz
.
File metadata
- Download URL: xtal2txt-0.1.0.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08c9d94235a096de5d1e83c66769188e1076da6df8838ca09889d4531ee816d9 |
|
MD5 | bcac68814ebc7c5c705966c54ea03308 |
|
BLAKE2b-256 | 36511ae508877fd0f8494dbf8cc3f0cfde9fe2b297d56c3b0626e036eb49d8f2 |
File details
Details for the file xtal2txt-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: xtal2txt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b3ccaf569c829d2db4c7cd080910b9d282c467c93e2b2a37f242fa55ca366bd |
|
MD5 | b0bf4887891e00782ec116e8d784a7b0 |
|
BLAKE2b-256 | a148c32f9425b91ab3d6942d2efe9b11f99b47c8b106f161def1b4de03c3adcc |