extract structured information from ethics paragraphs

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

docanalysis

Unsupervised entity extraction from sections of papers that have defined boilerplates. Examples of such sections include - Ethics Statements, Funders, Acknowledgments, and so on.

Purpose

Primary Purpose

Extracting Ethics Committees and other entities related to Ethics Statements from papers
Curating the extracted entities to public databases like Wikidata
Building a feedback loop where we go from unsupervised entity extraction to curating the extracted information in public repositories to then, supervised entity extraction.

Subsidary Purpose(s)

The use case can go beyond Ethics Statements. docanalysis is a general package that can extract relevant entities from the section of your interest.

Sections like Acknowledgements, Data Availability Statements, etc., all have a fairly generic sentence structure. All you have to do is create an ami dictionary that contains boilerplates of the section of your interest. You can, then, use docanalysis to extract entities. Check this section [dictionaries](https://github.com/petermr/docanalysis#What is-a-dictionary) which outlines steps for creating custom dictionaries. In case of acknowledgements or funding, you might be interested in the players involved. Or you might have a use-case which we might have never thought of!

Installation

Git clone the repository

git clone https://github.com/petermr/docanalysis.git

Run setup.py from inside the repository directory
```
python setup.py install
```

Tools Used and their purpose

pygetpapers - scrape repositories to download papers of interest
ami - section the papers
nltk - split sentence
spaCy - recognize Named-Entities and label them
- Here's the list of NER labels SpaCy's English model provides:
  CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
- In most of our projects (Ethics Statements and Acknowledgements Mining), we are mainly interested in GPE (Geopolitical Entities), ORG (Organization)

Documentation

extract_entities_from_papers(CORPUS_PATH, TERMS_XML_PATH, QUERY=None, HITS=None, make_project=False, install_ami=False, removefalse=True, create_csv=True, csv_name='entities.csv', labels_to_get=['GPE', 'ORG'])

Parameters: CORPUS_PATH: path to an existing corpus (CProject)
            TERMS_XML_PATH: path to ami dictionary (some are in ethics dictionary folder)
            QUERY: Query set to EPMC 
            HITS: No. of papers you wish to download 
            make_project: Defaults to False. To create a new CProject using pygetpapers set it to True                          
            install_ami: installs Java ami if given True
            removefalse: removes sentences with zero matches with dictionary phrases and sentences with no Named-Entities recognized
            create_csv: creates .csv output in CORPUS_PATH. 
            csv_name:Default csv file name is `entities.csv`
            labels_to_get: SpaCy recognizes Named-Entites and labels them. You can choose for lables you are interested by providing it as a list. For all available labels, check out the Tools Used section.

How to run?

We have created demo.py where you can run the package.

import os
from docanalysis import DocAnalysis
ethic_statement_creator = DocAnalysis()
dict_for_entities = ethic_statement_creator.extract_entities_from_papers(
    "essential oil AND chemical composition",
    100,
    os.path.join(
        os.getcwd(), "stem_cell_research_300",
    ),
    os.path.join(
        os.getcwd(), "ethics_dictionary", "ethics_key_phrases", "ethics_key_phrases.xml"
    ),
)
list_with_orgs = ethic_statement_creator.extract_particular_fields(
    dict_for_entities, 'ORG')
with open('org.text', 'w') as f:
    f.write(str(list_with_orgs))
list_with_gpe = ethic_statement_creator.extract_particular_fields(
    dict_for_entities, 'GPE')
with open('GPE.text', 'w') as f:
    f.write(str(list_with_gpe))

To break this down,

Variable snippet	What is it?
`essential oil AND chemical composition`	Query to `pygetpapers` (EPMC default)
`100`	number of hits
stem_cell_research_300	Output directory
"ethics_dictionary", "ethics_key_phrases", "ethics_key_phrases.xml"	dictionary path

What is a dictionary

Dictionary, in ami's terminology, a set of terms/phrases in XML format. Dictionaries related to ethics and acknowledgments are available in Ethics Dictionary folder

If you'd like to create a custom dictionary, you can find the steps, here

History

History is available in dictionary repository

Warning: The dictionary repository is messy!

Credits:

Daniel Mietchen, Peter Murray-Rust, Ayush Garg, Shweata N. Hegde

Research Idea

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.3.0

Nov 4, 2023

0.2.8

Nov 3, 2023

0.2.7

Nov 3, 2023

0.2.6

Nov 3, 2023

0.2.5

Oct 20, 2023

0.2.4

Aug 22, 2023

0.2.3

Aug 22, 2023

0.2.2

Aug 22, 2023

0.2.1

Aug 22, 2023

0.2.0

Sep 16, 2022

0.1.9

Jul 9, 2022

0.1.8

Jul 8, 2022

0.1.7

Jul 8, 2022

0.1.6

Jul 7, 2022

0.1.5

Jul 7, 2022

0.1.4

Jul 7, 2022

0.1.3

Jul 6, 2022

0.1.2

Jul 6, 2022

0.1.1

Jun 12, 2022

0.1.0

May 17, 2022

0.0.9

Apr 4, 2022

0.0.8

Mar 13, 2022

0.0.7

Feb 9, 2022

0.0.6

Feb 8, 2022

This version

0.0.5

Feb 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

docanalysis-0.0.5-py3-none-any.whl (27.9 kB view hashes)

Uploaded Feb 8, 2022 Python 3

Hashes for docanalysis-0.0.5-py3-none-any.whl

Hashes for docanalysis-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ee873f26b79169f75ce4b2251074a34874b6989d8176e18f7480ad24d0c7e5cb`
MD5	`d864768c6aeffd94dd85839d508263f5`
BLAKE2b-256	`992327925d6bd4d5ec4e9246b243c25c1f351e93cc159cd8e8c80394192ae6b3`