This package scrapes features of elements and compounds from specified websites
Project description
Webscraping Project
Predicting the Band Gap of Compounds using Elemental Descriptors
Abstract
It has been shown that there is a predictive relationship between 'known discrete scalar descriptors associated with crystal and electronic structure and observed properties of materials'. However, the property space of these materials is of high dimensionality which highlights the complex nature of predictive models at the fundamental level. Additionally, the elemental descriptors at this level have a certain degree of co-dependence which makes prediction even more complicated. It has been demonstrated using data reduction methods that the property space of observable material properties can be diminished. In this project, a dataset of elements and some of their corresponding elemental descriptors have been collected using webscraping techniques. The elemental descriptors/features were limited to five since it has been shown that it is possible to predict band gap energies using only five (5) elemental descriptors.
Motivation
Recent advances in material science and engineering have been focused on producing rational design rules and principles for material fabrication. The development of these design rules have huge implications for various fields such as crystal engineering, opto-electronics and photonics . In this regard, considerable attempts have been made to utilize already accumulated datasets to create models that facilitate the prediction of various material properties using machine learning techniques. Despite recent advances in this field, there is a dearth of machine-learning-based models to predict band gap energies.
Methodology
Implementing python libraries such as selenium and pandas, a database of elements with elemental descriptors have been extracted. The code was written to function in a multifaceted way as detailed below:
-
Open the desired website containing information on the elements or compound.
-
Extract specific information on the attributes of the element or compound.
- To do this, the python codes
OQMD_new_version.py
andperiodic_table_new.py
utilized. Within this code, a scraper class has been defined with the following attributes: n
whose definition depends on the code being executed. ForOQMD_new_version
,n
is an integer that defines the number of pages to extract data from whilen
inperiodic_table_new.py
defines the number of elements to extract data from. In this case data from 60 elements where extracted.root
defines the target url where data is being extracted.features
initialises a dictionary with keys that define the necessary data to be extrach for ecah element or compound. It takes either of these forms:features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]}
orfeatures = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]}
These codes contain the function
extract_data()
that carry out the extraction of elements and compounds data.- Specifically,
extract_data(to_DF)
depends onto_DF
which only accepts boolean values. to_DF determines whether or not the data will be converted into a Data Frame.
- To do this, the python codes
-
Convert the result into a dataframe for further processing us
- Using Pandas, the extracted data was stored as a data frame using the
convert_to_DF()
function which depends on the following variables:data_name
,file_out
,to_csv
.
- Using Pandas, the extracted data was stored as a data frame using the
-
Clean the data
-
For the extracted elements data, the boiling point of elements was extracted in both °C and Kelvin(K) with the boiling point in °C within parentheses. For consistency, we delete all values in parentheses leaving values in K. To do this, we import the module
regex
and implement the following code :re.sub(r'\([^)]*\)', '', '[filename]')
-
-
Import as an SQL database
- Implementing the python modules
psycopg2
andSQLAlchemy
, the output data was imported to SQL in tabular form.
- Implementing the python modules
Specifically, the main python library used for the extraction of data was selenium. Pandas was used to convert the raw data into a desired output (i.e. a csv file).
Setting up venv
-
Using anaconda3 set up a virtual environment (venv) while meeting necessary code requirements. All necessary requirements can be found in the file
requirements.txt
.source activate [env name]
pip install requirements.txt
Installing and Running
-
To install this package:
pip install el-compX-scraper
Running the Project
-
Within python import the necessary modules:
import scraper from scaper.OQMD_new_version import CompoundScraper from scraper.periodic_table_new import PeriodicTableScraper
-
To instantiate a scraper object, we can implement the CompoundScraper class. Hence,
root = "http://oqmd.org/api/search#apisearchresult" features = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]} scraper = CompoundScraper(n=1, root=root, list=[], features=features) scraper.extract_data()
Similarly for the PeriodicTableScraper,
root = "https://pubchem.ncbi.nlm.nih.gov/periodic-table/#view=list" features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]} scraper = PeriodicTableScraper(n=5, root=root, list=[], features=features) scraper.extract_data(to_DF=True)
-
We can also example script which instantiates specific scraper objects depending on the url. Thus to execute this,
import scraper.example
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for el_compX_scraper-0.1.6-py3.9.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b853dec86317d6d2aa4b0ec3d40f2074e920c254f8b0d287f49186a39f5cc27 |
|
MD5 | 6bebb890c7337754f7433f582baa8a26 |
|
BLAKE2b-256 | 50904fbd82284ed1dc1d6149fc3700d66506ea5677bcd8482f7aef20b5f6e1bd |