Extract chemical data from Safety Data Sheet documents
Project description
SDSParser
SDSParser is a browser-based app for extracting chemical data from Safety Data Sheet documents. SDSParser will speed up your data-entry process by eliminating the need to read through Safety Data Sheets to get the data you care about.
For a live demo, click here: SDSParser
For testing purposes, here are some SDS files to download and use:
Motivation
Built out of the need to quickly access chemical data from Safety Data Sheets for data-entry purposes. Each chemical manufacturer will stylize and structure their SDSs a little bit differently. SDSParser can easily be updated to read a new manufacturer format by adding a new set of regular expressions to match the format that that specific manufacturer uses.
Tech/framework used
- pdfminer, a tool for extracting information from PDF documents
- pytesseract, a python wrapper for Google's Tesseract-OCR
Features
Have some physical SDSs you need to scan and get data from? Have no fear, sds_parser will recognize your scanned file as an image and perform optical character recognition (ocr) to extract the text for you.
How to use?
Simply initialize SDSParser
with an optional list of data fields you wish to extract (e.g. ['manufacturer', 'flash_point']) to request_keys
. See configs.SDSRegexes.SDS_DATA_TITLES
for the proper keys to use. If no keys are requested, all available data fields will be searched.
sds_parser = SDSParser(**request_keys=<[keys]>)
then call .get_sds_data()
to retrieve the matches by passing in your SDS document in .pdf
format.
chemical_data = sds_parser.get_sds_data(file_path)
chemical_data
will be a dictionary object mapping request key names to their corresponding matches:
{'Manufacturer': 'Sigma-Aldrich',
'Product Name': 'Sodium dodecyl sulfate',
'Flash Point': '338',
'Specific Gravity': 'No data available',
'NFPA Fire': '3',
'NFPA Health': '2',
'NFPA Reactivity': '3',
'SARA 311/312': 'Data not listed',
'Revision Date': '06/13/2018',
'Physical State': 'Rods',
'CAS # (if pure)': '151-21-3',
'format': 'sigma_aldrich',
'filename': 'sigma_aldrich_23.pdf'}
If the specific field is not found in the SDS, .get_sds_data()
will return the string 'Data not listed'.
If the field is found, but no data is found under it, .get_sds_data()
will return the string 'No data available'.
License
MIT © Aris Stepe
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SDSParser-0.1.2.tar.gz
.
File metadata
- Download URL: SDSParser-0.1.2.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfc3add3b4c52ae261b67ec6ede1195f12791c5525133bf7421b906ce0cf5c1b |
|
MD5 | 4b19b0cb9d02cdf92ff386eca1ac6b29 |
|
BLAKE2b-256 | 20d7e2d46a52683507e22abbd3df09fe8d06e2393d82a4d06a60453aea56eb4e |
File details
Details for the file SDSParser-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: SDSParser-0.1.2-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eac9070f6698cf717af607afbb764e9788372b272e725406a7ec43d990581634 |
|
MD5 | 8e9d608b583547bc0b9ddf3b1e9f2463 |
|
BLAKE2b-256 | 0f3f4c8052342f16748467e7f6c1bed085b964ec2a8572bb0fe90d2b076cdf57 |