This package scrapes features of elements and compounds from specified websites

Project description

Webscraping Project

Predicting the Band Gap of Compounds using Elemental Descriptors

Abstract

It has been shown that there is a predictive relationship between 'known discrete scalar descriptors associated with crystal and electronic structure and observed properties of materials'. However, the property space of these materials is of high dimensionality which highlights the complex nature of predictive models at the fundamental level. Additionally, the elemental descriptors at this level have a certain degree of co-dependence which makes prediction even more complicated. It has been demonstrated using data reduction methods that the property space of observable material properties can be diminished. In this project, a dataset of elements and some of their corresponding elemental descriptors have been collected using webscraping techniques. The elemental descriptors/features were limited to five since it has been shown that it is possible to predict band gap energies using only five (5) elemental descriptors.

Motivation

Recent advances in material science and engineering have been focused on producing rational design rules and principles for material fabrication. The development of these design rules have huge implications for various fields such as crystal engineering, opto-electronics and photonics . In this regard, considerable attempts have been made to utilize already accumulated datasets to create models that facilitate the prediction of various material properties using machine learning techniques. Despite recent advances in this field, there is a dearth of machine-learning-based models to predict band gap energies.

Methodology

Implementing python libraries such as selenium and pandas, a database of elements with elemental descriptors have been extracted. The code was written to function in a multifaceted way as detailed below:

Open the desired website containing information on the elements or compound.
Extract specific information on the attributes of the element or compound.
- To do this, the python codes OQMD_new_version.py and periodic_table_new.py utilized. Within this code, a scraper class has been defined with the following attributes:
- n whose definition depends on the code being executed. For OQMD_new_version, n is an integer that defines the number of pages to extract data from while n in periodic_table_new.py defines the number of elements to extract data from. In this case data from 60 elements where extracted.
- root defines the target url where data is being extracted.
- features initialises a dictionary with keys that define the necessary data to be extrach for ecah element or compound. It takes either of these forms: features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]} or features = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]}
These codes contain the function extract_data() that carry out the extraction of elements and compounds data.
- Specifically, extract_data(to_DF) depends on to_DF which only accepts boolean values. to_DF determines whether or not the data will be converted into a Data Frame.
Convert the result into a dataframe for further processing us
- Using Pandas, the extracted data was stored as a data frame using the convert_to_DF() function which depends on the following variables: data_name, file_out, to_csv.
Clean the data
- For the extracted elements data, the boiling point of elements was extracted in both °C and Kelvin(K) with the boiling point in °C within parentheses. For consistency, we delete all values in parentheses leaving values in K. To do this, we import the module regex and implement the following code :
  
  re.sub(r'\([^)]*\)', '', '[filename]')
Import as an SQL database
- Implementing the python modules psycopg2 and SQLAlchemy , the output data was imported to SQL in tabular form.

Specifically, the main python library used for the extraction of data was selenium. Pandas was used to convert the raw data into a desired output (i.e. a csv file).

Setting up venv

Using anaconda3 set up a virtual environment (venv) while meeting necessary code requirements. All necessary requirements can be found in the file requirements.txt.

source activate [env name]

pip install requirements.txt

Installing and Running

To install this package:

pip install el-compX-scraper

Running the Project

Within python import the necessary modules:

    import scraper
    from scaper.OQMD_new_version import CompoundScraper
    from scraper.periodic_table_new import PeriodicTableScraper

To instantiate a scraper object, we can implement the CompoundScraper class. Hence,

root = "http://oqmd.org/api/search#apisearchresult"
features = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]}
scraper = CompoundScraper(n=1, root=root, list=[], features=features)
scraper.extract_data()

Similarly for the PeriodicTableScraper,

root = "https://pubchem.ncbi.nlm.nih.gov/periodic-table/#view=list"
features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]}
scraper = PeriodicTableScraper(n=5, root=root, list=[], features=features)
scraper.extract_data(to_DF=True)

We can also example script which instantiates specific scraper objects depending on the url. Thus to execute this,

import scraper.example

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Jul 6, 2022

0.1.6

Jul 6, 2022

0.1.5

Sep 6, 2021

0.1.4

Sep 4, 2021

0.1.3

Sep 3, 2021

0.1.2

Sep 2, 2021

0.1.1

Sep 2, 2021

0.1.0

Sep 2, 2021

0.0.9

Sep 2, 2021

0.0.8

Aug 31, 2021

0.0.7

Aug 31, 2021

0.0.6

Aug 31, 2021

0.0.5

Aug 31, 2021

0.0.4

Aug 31, 2021

0.0.3

Aug 31, 2021

0.0.2

Aug 31, 2021

0.0.1

Aug 31, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

el-compX-scraper-0.1.7.tar.gz (8.5 kB view hashes)

Uploaded Jul 6, 2022 Source

Hashes for el-compX-scraper-0.1.7.tar.gz

Hashes for el-compX-scraper-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`0d7f5798a353726880123a6de7b5f5d333fdd782f5c3608acdaa7d5b49ccf1cb`
MD5	`2a54df3d1acadb9e4ff4616754aac3e1`
BLAKE2b-256	`196f60d70dd1759361380f8a4e915e6b4634f1f3976b6e84f320106923049113`