Implementation of a pydantic BaseModel class for CAS Registry Numbers® (CAS RN®);
Project description
CAS Registry Number Validator and Sorter
A Python utility class for validating and sorting Chemical Abstract Service (CAS) Registry Numbers®. Pydantic is used for validation and is a dependency
Overview
The CAS class provides functionality to work with CAS Registry Numbers® (CAS RN®), which are unique numerical identifiers assigned to chemical substances. This tool helps validate the format and checksum of CAS RNs and enables sorting collections of CAS numbers.
Features
- Validates Chemical Abstract Service Registry Number® (CAS RN®) format
- Verifies CAS RN checksum
- Makes CAS numbers sortable in ascending or descending order based on numeric comparison rather than string-based comparison
- CAS object instances are hashed which allows for uses in Python Sets (i.e. a means to eliminate duplicates)
- CAS object instances are str printed as plain strings of the CAS formatted registry number
- CAS object instances can be JSON serialized by using Pydantic model_dump_json()
- Pandas extension dtype (CASDtype) for working with CAS numbers in DataFrames with full type safety and validation
Usage
from cas_reg import CAS
# Create a CAS instance
cas_number = CAS(num="7732-18-5") # Water
print(cas_number)
>>> "7732-18-5"
# use pydantic's by_alias to produce the model as a dictionary with CAS
print(cas_number.model_dump_json(by_alias=True))
>>> "{'CAS': '50-00-0'}"
# Invalid CAS numbers do not validate
try:
cas_string = "50-01-1"
my_cas_no = CAS(num=cas_string)
except ValueError as e:
print(f"CAS Number, {cas_string}, is invalid: {e}")
# Sort multiple CAS numbers
cas_str_list = ["7732-18-5", "67-56-1", "124-38-9"]
cas_list = [CAS(num=x) for x in cas_str_list]
cas_list.sort()
print([str(x) for x in cas_list])
>>> ["67-56-1", "124-38-9", "7732-18-5"]
# Use python sets to eliminate duplicates
cas_str_list2 = ["50-00-0", "7732-18-5", "67-56-1", "124-38-9", "50-00-0"]
cas_list2 = [CAS(num=x) for x in cas_str_list2]
unique_cas_list = list(set(cas_list2))
unique_cas_list.sort()
print([str(x) for x in unique_cas_list])
>>> ["50-00-0", "67-56-1", "124-38-9", "7732-18-5"]
Usage of pandas extension dtype
The cas-reg package optionally provides a custom pandas extension dtype (CASDtype) and array type (CASArray) that allows you to work with CAS Registry Numbers in pandas DataFrames with full type safety and validation.
Installation with pandas support
# With uv
uv add "cas-reg[pandas]"
# With pip
pip install "cas-reg[pandas]"
Creating DataFrames with CAS columns
import pandas as pd
from cas_reg import CAS
from cas_reg.pandas_ext import CASDtype, CASArray
# Create a DataFrame with CAS numbers directly
df = pd.DataFrame({
"cas": CASArray(["50-00-0", "58-08-2", "7732-18-5"]),
"name": ["Formaldehyde", "Caffeine", "Water"],
"molecular_weight": [30.03, 194.19, 18.02]
})
print(df)
# cas name molecular_weight
# 0 50-00-0 Formaldehyde 30.03
# 1 58-08-2 Caffeine 194.19
# 2 7732-18-5 Water 18.02
print(df["cas"].dtype)
# CAS
Converting existing DataFrames
If you have an existing DataFrame with CAS numbers stored as strings (object dtype), you can convert them to the CASDtype:
# Existing DataFrame with CAS numbers as strings
df = pd.DataFrame({
"cas": ["50-00-0", "58-08-2", "7732-18-5"],
"name": ["Formaldehyde", "Caffeine", "Water"]
})
print(df["cas"].dtype) # object
# Convert the column to CASDtype
df["cas"] = df["cas"].astype(CASDtype())
print(df["cas"].dtype) # CAS
# Each value is now a CAS object
print(type(df["cas"].iloc[0])) # <class 'cas_reg.CAS'>
Sorting DataFrames by CAS numbers
CAS numbers can be sorted naturally within a DataFrame:
df = pd.DataFrame({
"cas": CASArray(["58-08-2", "50-00-0", "7732-18-5", "67-56-1"]),
"name": ["Caffeine", "Formaldehyde", "Water", "Methanol"]
})
# Sort by CAS number (ascending)
df_sorted = df.sort_values(by="cas")
print(df_sorted)
# cas name
# 1 50-00-0 Formaldehyde
# 3 67-56-1 Methanol
# 0 58-08-2 Caffeine
# 2 7732-18-5 Water
# Sort by CAS number (descending)
df_desc = df.sort_values(by="cas", ascending=False)
Validation of CAS number structure
The CASDtype automatically validates CAS numbers when they are added to the array. Invalid CAS numbers will raise a ValueError:
# This will raise a ValueError due to invalid checksum
try:
df = pd.DataFrame({
"cas": CASArray(["50-00-1"]) # Invalid checksum
})
except ValueError as e:
print(f"Validation error: {e}")
# Validation error: Invalid CAS checksum for '50-00-1'
# Handle missing values with None or pd.NA
df = pd.DataFrame({
"cas": CASArray(["50-00-0", None, "58-08-2"]),
"name": ["Formaldehyde", "Unknown", "Caffeine"]
})
# Check for missing values
print(df["cas"].isna())
# 0 False
# 1 True
# 2 False
# Name: cas, dtype: bool
Working with CAS columns
# Create a Series with CAS numbers
cas_series = pd.Series(CASArray(["50-00-0", "58-08-2", "7732-18-5"]))
# Get min and max CAS numbers
print(cas_series.min()) # 50-00-0
print(cas_series.max()) # 7732-18-5
# Filter DataFrames
df = pd.DataFrame({
"cas": CASArray(["50-00-0", "58-08-2", "7732-18-5"]),
"toxicity": ["high", "medium", "low"]
})
high_toxicity = df[df["toxicity"] == "high"]
print(high_toxicity["cas"].iloc[0]) # 50-00-0
# Concatenate DataFrames with CAS columns
df1 = pd.DataFrame({"cas": CASArray(["50-00-0", "58-08-2"])})
df2 = pd.DataFrame({"cas": CASArray(["7732-18-5"])})
df_combined = pd.concat([df1, df2], ignore_index=True)
print(df_combined["cas"].dtype) # CAS
Joining DataFrames on CAS numbers
You can perform inner joins (or other merge operations) on DataFrames using CAS numbers as the key:
# Create two DataFrames with CAS number columns
chemicals_df = pd.DataFrame({
"cas": CASArray(["50-00-0", "58-08-2", "7732-18-5", "67-56-1"]),
"name": ["Formaldehyde", "Caffeine", "Water", "Methanol"],
"formula": ["CH2O", "C8H10N4O2", "H2O", "CH3OH"]
})
properties_df = pd.DataFrame({
"cas": CASArray(["58-08-2", "7732-18-5", "64-17-5", "50-00-0"]),
"boiling_point": [178, 100, 78, -19],
"state": ["solid", "liquid", "liquid", "gas"]
})
# Perform inner join on CAS number
merged_df = pd.merge(chemicals_df, properties_df, on="cas", how="inner")
print(merged_df)
# cas name formula boiling_point state
# 0 50-00-0 Formaldehyde CH2O -19 gas
# 1 58-08-2 Caffeine C8H10N4O2 178 solid
# 2 7732-18-5 Water H2O 100 liquid
# The CAS dtype is preserved after merge
print(merged_df["cas"].dtype) # CAS
# You can also do left, right, or outer joins
left_join = pd.merge(chemicals_df, properties_df, on="cas", how="left")
# This will include Methanol with NaN values for boiling_point and state
Converting to dictionary or JSON
df = pd.DataFrame({
"cas": CASArray(["50-00-0", "58-08-2"]),
"name": ["Formaldehyde", "Caffeine"]
})
# Convert to dictionary (CAS objects preserved)
records = df.to_dict(orient="records")
print(records[0]["cas"]) # CAS(num='50-00-0')
# Convert CAS objects to strings for JSON serialization
df["cas_str"] = df["cas"].astype(str)
json_data = df.to_json(orient="records")
Installation
With uv (recommended):
uv add cas-reg
With pip:
pip install cas-reg
Requirements
- Python 3.11+
- Pydantic 2.8.2+
Optional dependencies
For pandas support:
- pandas 2.0.0+
- numpy 1.24.0+
Install with: pip install "cas-reg[pandas]" or uv add "cas-reg[pandas]"
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cas_reg-0.1.0.tar.gz.
File metadata
- Download URL: cas_reg-0.1.0.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c6ab1b1263db366e63385b5c8b42339bac953d14b260539bc7f08fe20de6c3e
|
|
| MD5 |
c11b74885f69fe38ae3c2cf34f05afe6
|
|
| BLAKE2b-256 |
96928d27896fbdc88b21f662541f16f116c4bc9e99a31c93ce4e73b05d0be372
|
File details
Details for the file cas_reg-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cas_reg-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44892379112775aba87101ad762206af0f396bff701b23426b37caa8ebeb177a
|
|
| MD5 |
e113faf7871214fb30724006dcad1984
|
|
| BLAKE2b-256 |
27d349761f6adc19262000a9bb8c67a610ba82f3c8854f695db17c18b1e3bb89
|