extract structured information from ethics paragraphs
Project description
docanalysis
Unsupervised entity extraction from sections of papers that have defined boilerplates. Examples of such sections include - Ethics Statements, Funders, Acknowledgments, and so on.
Purpose
Primary Purpose
- Extracting Ethics Committees and other entities related to Ethics Statements from papers
- Curating the extracted entities to public databases like Wikidata
- Building a feedback loop where we go from unsupervised entity extraction to curating the extracted information in public repositories to then, supervised entity extraction.
Subsidary Purpose(s)
The use case can go beyond Ethics Statements. docanalysis
is a general package that can extract relevant entities from the section of your interest.
Sections like Acknowledgements, Data Availability Statements, etc., all have a fairly generic sentence structure. All you have to do is create an ami
dictionary that contains boilerplates of the section of your interest. You can, then, use docanalysis
to extract entities. Check this section [dictionaries](https://github.com/petermr/docanalysis#What is-a-dictionary) which outlines steps for creating custom dictionaries. In case of acknowledgements or funding, you might be interested in the players involved. Or you might have a use-case which we might have never thought of!
Installation
- Git clone the repository
git clone https://github.com/petermr/docanalysis.git
- Run
setup.py
from inside the repository directorypython setup.py install
Tools Used and their purpose
pygetpapers
- scrape repositories to download papers of interestami
- section the papers- nltk - split sentence
- spaCy - recognize Named-Entities and label them
- Here's the list of NER labels SpaCy's English model provides:
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
- In most of our projects (Ethics Statements and Acknowledgements Mining), we are mainly interested in GPE (Geopolitical Entities), ORG (Organization)
- Here's the list of NER labels SpaCy's English model provides:
Documentation
extract_entities_from_papers(CORPUS_PATH, TERMS_XML_PATH, QUERY=None, HITS=None, make_project=False, install_ami=False, removefalse=True, create_csv=True, csv_name='entities.csv', labels_to_get=['GPE', 'ORG'])
Parameters: CORPUS_PATH: path to an existing corpus (CProject)
TERMS_XML_PATH: path to ami dictionary (some are in ethics dictionary folder)
QUERY: Query set to EPMC
HITS: No. of papers you wish to download
make_project: Defaults to False. To create a new CProject using pygetpapers set it to True
install_ami: installs Java ami if given True
removefalse: removes sentences with zero matches with dictionary phrases and sentences with no Named-Entities recognized
create_csv: creates .csv output in CORPUS_PATH.
csv_name:Default csv file name is `entities.csv`
labels_to_get: SpaCy recognizes Named-Entites and labels them. You can choose for lables you are interested by providing it as a list. For all available labels, check out the Tools Used section.
How to run?
We have created demo.py
where you can run the package.
import os
from docanalysis import DocAnalysis
ethic_statement_creator = DocAnalysis()
dict_for_entities = ethic_statement_creator.extract_entities_from_papers(
"essential oil AND chemical composition",
100,
os.path.join(
os.getcwd(), "stem_cell_research_300",
),
os.path.join(
os.getcwd(), "ethics_dictionary", "ethics_key_phrases", "ethics_key_phrases.xml"
),
)
list_with_orgs = ethic_statement_creator.extract_particular_fields(
dict_for_entities, 'ORG')
with open('org.text', 'w') as f:
f.write(str(list_with_orgs))
list_with_gpe = ethic_statement_creator.extract_particular_fields(
dict_for_entities, 'GPE')
with open('GPE.text', 'w') as f:
f.write(str(list_with_gpe))
To break this down,
Variable snippet | What is it? |
---|---|
essential oil AND chemical composition |
Query to pygetpapers (EPMC default) |
100 |
number of hits |
stem_cell_research_300 | Output directory |
"ethics_dictionary", "ethics_key_phrases", "ethics_key_phrases.xml" | dictionary path |
What is a dictionary
Dictionary, in ami
's terminology, a set of terms/phrases in XML format.
Dictionaries related to ethics and acknowledgments are available in Ethics Dictionary folder
If you'd like to create a custom dictionary, you can find the steps, here
History
History is available in dictionary
repository
Warning: The dictionary repository is messy!
Credits:
Daniel Mietchen, Peter Murray-Rust, Ayush Garg, Shweata N. Hegde
Research Idea
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for docanalysis-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be399d9427a05f26906c2acc98e7bc38cffe43423a2080a21ad76f83e604c258 |
|
MD5 | 1063453f4c0cf6209a4fbea6a3fd9027 |
|
BLAKE2b-256 | d6e14019fa8ab4fabae6c6e632faa68bd3fbb1142b6fe70574cbecf846c03123 |