Extract chemical data from Safety Data Sheet documents
Project description
![alt text](https://github.com/astepe/sds_parser/blob/master/LogoSample_ByTailorBrands.jpg)
## SDSParser
SDSParser is a browser-based app for extracting chemical data from Safety Data Sheet documents. SDSParser will speed up your
data-entry process by eliminating the need to read through Safety Data Sheets to get the data you care about.
For a live demo, click here: [SDSParser](http://www.arisstepe.com/projects/submit_sds)
For testing purposes, here are some SDS files to download and use:
* [Fisher Scientific](https://www.fishersci.com/store/msds?partNumber=M2131&productDescription=MET+ISOBUTYL+KETONE+CR+ACS+1L&vendorId=VN00033897&countryCode=US&language=en)
* [Sigma Aldrich](https://www.sigmaaldrich.com/MSDS/MSDS/DisplayMSDSPage.do?country=US&language=en&productNumber=P5958&brand=SIGALD&PageToGoToURL=https%3A%2F%2Fwww.sigmaaldrich.com%2Fcatalog%2Fsearch%3Fterm%3Dpotassium%26interface%3DAll%26N%3D0%26mode%3Dmatch%2520partialmax%26lang%3Den%26region%3DUS%26focus%3Dproduct)
## Motivation
Built out of the need to quickly access chemical data from Safety Data Sheets for data-entry purposes. Each chemical manufacturer will stylize and structure their SDSs a little bit differently. SDSParser can easily be updated to read a new manufacturer format by adding a new set of regular expressions to match the format that that specific manufacturer uses.
## Tech/framework used
* [pdfminer](https://github.com/euske/pdfminer), a tool for extracting information from PDF documents
* [pytesseract](https://pypi.org/project/pytesseract/), a python wrapper for Google's Tesseract-OCR
## Features
Have some physical SDSs you need to scan and get data from? Have no fear, sds_parser will recognize your scanned file as an image and perform optical character recognition (ocr) to extract the text for you.
## How to install
`pip install SDSParser`
## How to use
Simply initialize `SDSParser` with an optional list of data fields you wish to extract (e.g. ['manufacturer', 'flash_point']) to the `request_keys` key-word argument. See `configs.SDSRegexes.REQUEST_KEYS` for the proper keys to use. If no keys are requested, all available data fields will be searched.
```
>>> from sdsparser import SDSParser
>>> request_keys = ['manufacturer', 'flash_point', 'specific_gravity', 'product_name', 'sara_311', 'nfpa_fire']
>>> parser = SDSParser(request_keys=request_keys)
```
Here is a list of the keys to use.
```
>>> from sdsparser.configs.SDSRegexes import REQUEST_KEYS
>>> REQUEST_KEYS
[
'manufacturer',
'product_name',
'flash_point',
'specific_gravity',
'nfpa_fire',
'nfpa_health',
'nfpa_reactivity',
'sara_311',
'revision_date',
'physical_state',
'cas_number',
]
```
Call `parser.get_sds_data('path/to/ExampleSDS.pdf')` and pass in the path to your SDS document to get the sds data.
```
>>> sds_data = parser.get_sds_data('path/to/SafetyDataSheet.pdf')
```
`.get_sds_data` returns a dictionary object mapping request key names to their corresponding matches
```
>>> sds_data
{
'manufacturer': 'Sigma-Aldrich',
'product_name': 'Sodium dodecyl sulfate',
'flash_point': '338 F',
'specific_gravity': '3.2',
'sara_311': 'Data not listed'
'nfpa_fire': 'No data available'
}
```
If the heading for the requested data type is not found in the SDS, `.get_sds_data` will return the string 'Data not listed'.
If the heading is found, but no data is found under it, `.get_sds_data` will return the string 'No data available'.
## SDSParser-cli
In your terminal
```
path/to/sds/directory $ sdsparser parse --flash_point --specific_gravity
{'fisher_1.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'},
'fisher_2.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'},
'fisher_3.pdf': {'flash_point': 'No data available',
'specific_gravity': '1.84'},
'fisher_5.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'}}
```
or
```
path/to/sds/directory $ sdsparser parse --csv
path/to/sds/directory $ cat sds_data.csv
Fisher,Data not listed,No data available,No data available,1,0,0,/312 Hazard CategoriesSee section 2 for more informationCWA (Clean Water Act)Not,26-Jan-2018,Powder,Data not listed
Fisher,"Salicylic acid, sodium salt",No data available,(etc...)
```
for more information
```
$ sdsparser --help
```
or
```
$ sdsparser parse --help
```
## License
MIT © [Aris Stepe](http://www.arisstepe.com/)
## SDSParser
SDSParser is a browser-based app for extracting chemical data from Safety Data Sheet documents. SDSParser will speed up your
data-entry process by eliminating the need to read through Safety Data Sheets to get the data you care about.
For a live demo, click here: [SDSParser](http://www.arisstepe.com/projects/submit_sds)
For testing purposes, here are some SDS files to download and use:
* [Fisher Scientific](https://www.fishersci.com/store/msds?partNumber=M2131&productDescription=MET+ISOBUTYL+KETONE+CR+ACS+1L&vendorId=VN00033897&countryCode=US&language=en)
* [Sigma Aldrich](https://www.sigmaaldrich.com/MSDS/MSDS/DisplayMSDSPage.do?country=US&language=en&productNumber=P5958&brand=SIGALD&PageToGoToURL=https%3A%2F%2Fwww.sigmaaldrich.com%2Fcatalog%2Fsearch%3Fterm%3Dpotassium%26interface%3DAll%26N%3D0%26mode%3Dmatch%2520partialmax%26lang%3Den%26region%3DUS%26focus%3Dproduct)
## Motivation
Built out of the need to quickly access chemical data from Safety Data Sheets for data-entry purposes. Each chemical manufacturer will stylize and structure their SDSs a little bit differently. SDSParser can easily be updated to read a new manufacturer format by adding a new set of regular expressions to match the format that that specific manufacturer uses.
## Tech/framework used
* [pdfminer](https://github.com/euske/pdfminer), a tool for extracting information from PDF documents
* [pytesseract](https://pypi.org/project/pytesseract/), a python wrapper for Google's Tesseract-OCR
## Features
Have some physical SDSs you need to scan and get data from? Have no fear, sds_parser will recognize your scanned file as an image and perform optical character recognition (ocr) to extract the text for you.
## How to install
`pip install SDSParser`
## How to use
Simply initialize `SDSParser` with an optional list of data fields you wish to extract (e.g. ['manufacturer', 'flash_point']) to the `request_keys` key-word argument. See `configs.SDSRegexes.REQUEST_KEYS` for the proper keys to use. If no keys are requested, all available data fields will be searched.
```
>>> from sdsparser import SDSParser
>>> request_keys = ['manufacturer', 'flash_point', 'specific_gravity', 'product_name', 'sara_311', 'nfpa_fire']
>>> parser = SDSParser(request_keys=request_keys)
```
Here is a list of the keys to use.
```
>>> from sdsparser.configs.SDSRegexes import REQUEST_KEYS
>>> REQUEST_KEYS
[
'manufacturer',
'product_name',
'flash_point',
'specific_gravity',
'nfpa_fire',
'nfpa_health',
'nfpa_reactivity',
'sara_311',
'revision_date',
'physical_state',
'cas_number',
]
```
Call `parser.get_sds_data('path/to/ExampleSDS.pdf')` and pass in the path to your SDS document to get the sds data.
```
>>> sds_data = parser.get_sds_data('path/to/SafetyDataSheet.pdf')
```
`.get_sds_data` returns a dictionary object mapping request key names to their corresponding matches
```
>>> sds_data
{
'manufacturer': 'Sigma-Aldrich',
'product_name': 'Sodium dodecyl sulfate',
'flash_point': '338 F',
'specific_gravity': '3.2',
'sara_311': 'Data not listed'
'nfpa_fire': 'No data available'
}
```
If the heading for the requested data type is not found in the SDS, `.get_sds_data` will return the string 'Data not listed'.
If the heading is found, but no data is found under it, `.get_sds_data` will return the string 'No data available'.
## SDSParser-cli
In your terminal
```
path/to/sds/directory $ sdsparser parse --flash_point --specific_gravity
{'fisher_1.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'},
'fisher_2.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'},
'fisher_3.pdf': {'flash_point': 'No data available',
'specific_gravity': '1.84'},
'fisher_5.pdf': {'flash_point': 'No data available',
'specific_gravity': 'No data available'}}
```
or
```
path/to/sds/directory $ sdsparser parse --csv
path/to/sds/directory $ cat sds_data.csv
Fisher,Data not listed,No data available,No data available,1,0,0,/312 Hazard CategoriesSee section 2 for more informationCWA (Clean Water Act)Not,26-Jan-2018,Powder,Data not listed
Fisher,"Salicylic acid, sodium salt",No data available,(etc...)
```
for more information
```
$ sdsparser --help
```
or
```
$ sdsparser parse --help
```
## License
MIT © [Aris Stepe](http://www.arisstepe.com/)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
SDSParser-0.2.2.tar.gz
(12.7 kB
view details)
Built Distribution
SDSParser-0.2.2-py3-none-any.whl
(24.5 kB
view details)
File details
Details for the file SDSParser-0.2.2.tar.gz
.
File metadata
- Download URL: SDSParser-0.2.2.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d95a1c6bf4222260824b4cf4abb59befd1b72ce9962dabf094dc36b1ee05ac19 |
|
MD5 | 56048a40475a43b1e234a5e181e34a1d |
|
BLAKE2b-256 | 3b6fe28e8cf850bfed22a9f8658d9f716c342d1ba87a96ce1fcab035a6077e55 |
File details
Details for the file SDSParser-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: SDSParser-0.2.2-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75ece68ee690c819e9f027cf337920eeb25cf146432cc9afb69b8dc95ad74c69 |
|
MD5 | 5fdc7bac215100e4c1a37a63a206afaf |
|
BLAKE2b-256 | cb2f94c52d12a256b666a0557589f2bbe1386823cecfc56e00133d032a30765b |