Skip to main content

This package scrapes features of elements and compounds from specified websites

Project description

Webscraping Project

Predicting the Band Gap of Compounds using Elemental Descriptors

Abstract

It has been shown that there is a predictive relationship between 'known discrete scalar descriptors associated with crystal and electronic structure and observed properties of materials'. However, the property space of these materials is of high dimensionality which highlights the complex nature of predictive models at the fundamental level. Additionally, the elemental descriptors at this level have a certain degree of co-dependence which makes prediction even more complicated. It has been demonstrated using data reduction methods that the property space of observable material properties can be diminished. In this project, a dataset of elements and some of their corresponding elemental descriptors have been collected using webscraping techniques. The elemental descriptors/features were limited to five since it has been shown that it is possible to predict band gap energies using only five (5) elemental descriptors.

Motivation

Recent advances in material science and engineering have been focused on producing rational design rules and principles for material fabrication. The development of these design rules have huge implications for various fields such as crystal engineering, opto-electronics and photonics . In this regard, considerable attempts have been made to utilize already accumulated datasets to create models that facilitate the prediction of various material properties using machine learning techniques. Despite recent advances in this field, there is a dearth of machine-learning-based models to predict band gap energies.

Methodology

Implementing python libraries such as selenium and pandas, a database of elements with elemental descriptors have been extracted. The code was written to function in a multifaceted way as detailed below:

  • Open the desired website containing information on the elements or compound.

  • Extract specific information on the attributes of the element or compound.

    • To do this, the python codes OQMD_new_version.py and periodic_table_new.py utilized. Within this code, a scraper class has been defined with the following attributes:
    • n whose definition depends on the code being executed. For OQMD_new_version, n is an integer that defines the number of pages to extract data from while n in periodic_table_new.py defines the number of elements to extract data from. In this case data from 60 elements where extracted.
    • root defines the target url where data is being extracted.
    • features initialises a dictionary with keys that define the necessary data to be extrach for ecah element or compound. It takes either of these forms: features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]} or features = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]}

    These codes contain the function extract_data() that carry out the extraction of elements and compounds data.

    • Specifically, extract_data(to_DF) depends on to_DF which only accepts boolean values. to_DF determines whether or not the data will be converted into a Data Frame.
  • Convert the result into a dataframe for further processing us

    • Using Pandas, the extracted data was stored as a data frame using the convert_to_DF() function which depends on the following variables: data_name, file_out, to_csv.
  • Clean the data

    • For the extracted elements data, the boiling point of elements was extracted in both °C and Kelvin(K) with the boiling point in °C within parentheses. For consistency, we delete all values in parentheses leaving values in K. To do this, we import the module regex and implement the following code :

      re.sub(r'\([^)]*\)', '', '[filename]')

  • Import as an SQL database

    • Implementing the python modules psycopg2 and SQLAlchemy , the output data was imported to SQL in tabular form.

Specifically, the main python library used for the extraction of data was selenium. Pandas was used to convert the raw data into a desired output (i.e. a csv file).

Setting up venv

  • Using anaconda3 set up a virtual environment (venv) while meeting necessary code requirements. All necessary requirements can be found in the file requirements.txt.

    source activate [env name]

    pip install requirements.txt

Installing and Running

  • To install this package:

    pip install el-compX-scraper

Running the Project

  1. Within python import the necessary modules:

        import scraper
        from scaper.OQMD_new_version import CompoundScraper
        from scraper.periodic_table_new import PeriodicTableScraper 
    
  2. To instantiate a scraper object, we can implement the CompoundScraper class. Hence,

    root = "http://oqmd.org/api/search#apisearchresult"
    features = {'Name':[], 'Spacegroup':[], 'Volume':[], 'Band_gap':[]}
    scraper = CompoundScraper(n=1, root=root, list=[], features=features)
    scraper.extract_data()
    
    

    Similarly for the PeriodicTableScraper,

    root = "https://pubchem.ncbi.nlm.nih.gov/periodic-table/#view=list"
    features = {'Element_Name':[], 'Atomic_Number':[], 'Electronegativity':[], 'Boiling_Point':[]}
    scraper = PeriodicTableScraper(n=5, root=root, list=[], features=features)
    scraper.extract_data(to_DF=True)
    
  3. We can also example script which instantiates specific scraper objects depending on the url. Thus to execute this,

    import scraper.example

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

el-compX-scraper-0.1.7.tar.gz (8.5 kB view details)

Uploaded Source

File details

Details for the file el-compX-scraper-0.1.7.tar.gz.

File metadata

  • Download URL: el-compX-scraper-0.1.7.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.4

File hashes

Hashes for el-compX-scraper-0.1.7.tar.gz
Algorithm Hash digest
SHA256 0d7f5798a353726880123a6de7b5f5d333fdd782f5c3608acdaa7d5b49ccf1cb
MD5 2a54df3d1acadb9e4ff4616754aac3e1
BLAKE2b-256 196f60d70dd1759361380f8a4e915e6b4634f1f3976b6e84f320106923049113

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page