A light-weight Python API for classifying scientific documents with the topics from the Computer Science Ontology (https://cso.kmi.open.ac.uk/home).

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

CSO-Classifier

Abstract

Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this repository, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of research areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

Abstract
Table of contents
About
Requirements
Releases
- v2.1
- v2.0
- v1.0
List of Files
Word2vec model and token-to-cso-combined file generation
- Word Embedding generation
- token-to-cso-combined file
Usage examples
License
References

About

The CSO Classifier is a novel application that takes as input the text from abstract, title, and keywords of a research paper and outputs a list of relevant concepts from CSO. It consists of two main components: (i) the syntactic module and (ii) the semantic module. Figure 1 depicts its architecture. The syntactic module parses the input documents and identifies CSO concepts that are explicitly referred in the document. The semantic module uses part-of-speech tagging to identify promising terms and then exploits word embeddings to infer semantically related topics. Finally, the CSO Classifier combines the results of these two modules and enhances them by including relevant super-areas.

Figure 1: Framework of CSO Classifier

Requirements

Ensure you have Python 3 installed.
Install the necessary depepencies by executing the following command:pip install -r requirements.txt
Download English package for spaCy using python -m spacy download en_core_web_sm

Releases

Here we list the available releases for the CSO Classifier. These releases are available for download both from Github and Zenodo.

v2.1

This new release (version v2.1) makes the CSO Classifier more scalable. Compared to its previous version (v2.0), the classifier relies on a cached word2vec model which connects the words within the model vocabulary directly with the CSO topics. Thanks to this cache, the classifier is able to quickly retrieve all CSO topics that could be inferred by given tokens, speeding up the processing time. In addition, this cache is lighter (~64M) compared to the actual word2vec model (~366MB), which allows to save additional time at loading time.

Thanks to this improvement the CSO Classifier is around 24x faster and can be easily run on large corpus of scholarly data.

Download from:

v2.0

The second version (v2.0) implements the CSO Classifier as described in the about section. It combines the results of the syntactic and semantic modules, and then it enriches it with their supertopics. Compared to v1.0, it adds a semantic layer that allows to generate a more comprehensive result, identifying research topics that are not explicitely available in the metadata. The semantic module relies on a Word2vec model trained on over 4.5M papers in Computer Science. Below we show more in detail how we trained such model. In this version of the classifier, we pickled the model to speed-up the process of loading into memory (~4.5 times faster).

Salatino, A.A., Osborne, F., Thanapalasingam, T. and Motta, E. 2018. The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles. Available in Pre-Print here

Download from:

v1.0

The first version (v1.0) of the CSO Classifier is an implementations of the syntactic module, which was also previously used to support the semi-automatic annotation of proceedings at Springer Nature [1]. This classifier aims at syntactically match n-grams (unigrams, bigrams and trigrams) of the input document with concepts within CSO.

More details about this version of the classifier can be found within:

Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F. and Motta, E. 2018. Classifying Research Papers with the Computer Science Ontology. ISWC-P&D-Industry-BlueSky 2018 (2018). Read more

Download from:

List of Files

CSO-Classifier.ipynb: :page_facing_up: Python notebook for executing the classifier
requirements.txt: :page_facing_up: File containing the necessary libraries to run the classifier
images: :file_folder: folder containing some pictures, e.g., the workflow showed above
classifier: :file_folder: Folder containing the main functionalities of the classifier
- classifier.py: :page_facing_up: contains the function for running the CSO Classifier
- syntacticmodule.py: :page_facing_up: functionalities that implement the syntactic module
- semanticmodule.py: :page_facing_up: functionalities that implement the semantic module
- misc.py: :page_facing_up: some miscellaneous functionalities
- models: :file_folder: Folder containing the word2vec model and CSO
  - cso.csv: :page_facing_up: file containing the Computer Science Ontology in csv
  - cso.p: :page_facing_up: serialised file containing the Computer Science Ontology (pickled)
  - token-to-cso-combined.json: :page_facing_up: file containing the cached word2vec model. This json file contains a dictionary in which each token of the corpus vocabulary, has been mapped with the corresponding CSO topics. Below we explain how this file has been generated.

Word2vec model and token-to-cso-combined file generation

In this section, we describe how we generated the word2vec model used within the CSO Classifier and what is the token-to-cso-combined file.

Word Embedding generation

We applied the word2vec approach [2,3] to a collection of text from the Microsoft Academic Graph (MAG) for generating word embeddings. MAG is a scientific knowledge base and a heterogeneous graph containing scientific publication records, citation relationships, authors, institutions, journals, conferences, and fields of study. It is the largest dataset of scholarly data publicly available, and, as of December 2018, it contains more than 210 million publications.

We first downloaded titles, and abstracts of 4,654,062 English papers in the field of Computer Science. Then we pre-processed the data by replacing spaces with underscores in all n-grams matching the CSO topic labels (e.g., “digital libraries” became “digital_libraries”) and for frequent bigrams and trigrams (e.g., “highest_accuracies”, “highly_cited_journals”). These frequent n-grams were identified by analysing combinations of words that co-occur together, as suggested in [2] and using the parameters showed in Table 1. Indeed, while it is possible to obtain the vector of a n-gram by averaging the embedding vectors of all it words, the resulting representation usually is not as good as the one obtained by considering the n-gram as a single word during the training phase.

Finally, we trained the word2vec model using the parameters provided in Table 2. The parameters were set to these values after testing several combinations.

min-count	threshold
5	10

Table 1: Parameters used during the collocation words analysis

method	emb. size	window size	min count cutoff
skipgram	128	10	10

Table 2: Parameters used for training the word2vec model.

After training the model, we obtained a gensim.models.keyedvectors.Word2VecKeyedVectors object weighing 366MB. You can download the model from here.

The size of the model hindered the performance of the classifier in two ways. Firstly, it required several seconds to be loaded into memory. This was partially fixed by serialising the model file (using python pickle, see version v2.0 of CSO Classifier, ~4.5 times faster). Secondly, while processing a document, the classifier needs to retrieve the top 10 similar words for all tokens, and compare them with CSO topics. In performing such operation, the model would recquire several seconds, becoming a bottleneck for the classification process.

To this end, we decided to create a cached model (token-to-cso-combined.json) which is a dictionary that directly connects all token available in the vocabulary of the model with the CSO topics. This strategy allows to quickly retrieve all CSO topics that can be inferred by a particular token.

token-to-cso-combined file

To generate this file, we collected all the set of words available within the vocabulary of the model. Then iterating on each word, we retrieved its top 10 similar words from the model, and we computed their Levenshtein similarity against all CSO topics. If the similarity was above 0.7, we created a record which stored all CSO topics triggered by the initial word.

Usage examples

In this section, we explain how to run the CSO Classifier to classify a single or multiple (batch mode) papers.

Classifying a single paper (SP)

Sample Input (SP)

The sample input is a dictionary containing title, abstract and keywords as keys:

paper = {
        "title": "De-anonymizing Social Networks",
        "abstract": "Operators of online social networks are increasingly sharing potentially "
            "sensitive information about users and their relationships with advertisers, application "
            "developers, and data-mining researchers. Privacy is typically protected by anonymization, "
            "i.e., removing names, addresses, etc. We present a framework for analyzing privacy and "
            "anonymity in social networks and develop a new re-identification algorithm targeting "
            "anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, "
            "we show that a third of the users who can be verified to have accounts on both Twitter, a "
            "popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified "
            "in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is "
            "based purely on the network topology, does not require creation of a large number of dummy "
            "\"sybil\" nodes, is robust to noise and all existing defenses, and works even when the overlap "
            "between the target network and the adversary's auxiliary information is small.",
        "keywords": "data mining, data privacy, graph theory, social networking (online)"
        }

Run (SP)

Just import the classifier and run it:

import classifier.classifier as CSO
result = CSO.run_cso_classifier(paper, modules = "both", enhancement = "first")
print(result)

To observe the available settings please refer to the Parameters section.

Sample Output (SP)

As output the classifier returns a dictionary with four components: (i) syntactic, (ii) semantic, (iii) union, and (iv) enhanced. Below you can find an example. The keys syntactic and semantic respectively contain the topics returned by the syntacic and semantic module. Union contains the unique topics found by the previous two modules. In ehancement you can find the relevant super-areas.

{
    "syntactic": [
        "sensitive informations",
        "graph theory",
        "real-world networks",
        "network topology",
        "social networks",
        "anonymity",
        "anonymization",
        "twitter",
        "microblogging",
        "privacy",
        "data privacy",
        "online social networks",
        "data mining"
    ],
    "semantic": [
        "social networks",
        "online social networks",
        "data mining",
        "privacy",
        "data privacy",
        "anonymization",
        "anonymity",
        "twitter",
        "microblogging",
        "topology",
        "network topology",
        "graph theory",
        "network architecture",
        "network structures",
        "social networking sites",
        "association rules",
        "micro-blog"
    ],
    "union": [
        "sensitive informations",
        "social networking sites",
        "micro-blog",
        "network architecture",
        "graph theory",
        "social networks",
        "network topology",
        "real-world networks",
        "topology",
        "anonymity",
        "anonymization",
        "association rules",
        "twitter",
        "microblogging",
        "network structures",
        "privacy",
        "data privacy",
        "online social networks",
        "data mining"
    ],
    "enhanced": [
        "complex networks",
        "privacy preserving",
        "world wide web",
        "theoretical computer science",
        "social media",
        "network protocols",
        "access control",
        "security of data",
        "online systems",
        "electric network topology",
        "computer science",
        "facebook",
        "network security",
        "neural networks",
        "authentication"
    ]
}

Classifying in batch mode (BM)

Sample Input (BM)

The sample input is a dictionary of dictionaries. Each key is a paper id (example id1, see below) and its value is itself a dictionary containing title, abstract and keywords.

papers = {
    "id1": {
        "title": "De-anonymizing Social Networks",
        "abstract": "Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc. We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy \"sybil\" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.",
        "keywords": "data mining, data privacy, graph theory, social networking (online)"
    },
    "id2": {
        "title": "Title of sample paper id2",
        "abstract": "Abstract of sample paper id2",
        "keywords": "keyword1, keyword2, ..., keywordN"
    }
}

Run (BM)

Import the python script and run the classifier:

import classifier.classifier as CSO
result = CSO.run_cso_classifier_batch_mode(papers, modules = "both", enhancement = "first")
print(result)

To observe the available settings please refer to the Parameters section.

Sample Output (BM)

As output the classifier returns a dictionary of dictionaries. For each classified paper (identified by their id), it returns a dictionary containing four components: (i) syntactic, (ii) semantic, (iii) union, and (iv) enhanced. Below you can find an example. The keys syntactic and semantic respectively contain the topics returned by the syntacic and semantic module. Union contains the unique topics found by the previous two modules. In ehancement you can find the relevant super-areas.

{
    "id1": {
        "syntactic": [
            "sensitive informations","graph theory", "real-world networks", "network topology", "social networks", "anonymity", "anonymization", "twitter", "microblogging", "privacy", "data privacy", "online social networks", "data mining"
        ],
        "semantic": [
            "social networks", "online social networks", "data mining", "privacy", "data privacy", "anonymization", "anonymity", "twitter", "microblogging", "topology", "network topology", "graph theory", "network architecture", "network structures", "social networking sites", "association rules", "micro-blog"
        ],
        "union": [
            "sensitive informations", "social networking sites", "micro-blog", "network architecture", "graph theory", "social networks", "network topology", "real-world networks", "topology", "anonymity", "anonymization", "association rules", "twitter", "microblogging", "network structures", "privacy", "data privacy", "online social networks", "data mining"
        ],
        "enhanced": [
            "complex networks", "privacy preserving", "world wide web", "theoretical computer science", "social media", "network protocols", "access control", "security of data", "online systems", "electric network topology", "computer science", "facebook", "network security", "neural networks", "authentication"
        ]
    },
    "id2": {
        "syntactic": [...],
        "semantic": [...],
        "union": [...],
        "enhanced": [...]
    }
}

Parameters

Beside the paper(s), the function running the CSO Classifier accepts two additional parameters: (i) modules and (ii) enhancement. Both parameters are strings that define a particular behaviour for the classifier.

(1) The parameter modules can be either "syntactic", "semantic", or "both". Using the value "syntactic", the classifier will run only the syntactic module. Using the "semantic" value, instead, the classifier will use only the semantic module. Finally, using "both", the classifier will run both syntactic and semantic modules and combine their results. The default value for modules is both.

(2) The parameter enhancement can be either "first", "all", or "no". This parameters controls whether the classifier will try to infer, given a topic (e.g., Linked Data), only the direct super-topics (e.g., Semantic Web) or all its super-topics (e.g., Semantic Web, WWW, Computer Science). Using "first" as value, it will infer only the direct super topics. Instead, if using "all", the classifier will infer all its super-topics. Using "no" the classifier will not perform any enhancement. The default value for enhancement is first.

License

Apache 2.0

References

[1] Osborne, F., Salatino, A., Birukou, A. and Motta, E. 2016. Automatic Classification of Springer Nature Proceedings with Smart Topic Miner. The Semantic Web -- ISWC 2016. 9982 LNCS, (2016), 383–399. DOI:https://doi.org/10.1007/978-3-319-46547-0_33

[2] Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. (Jan. 2013).

[3] Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Distributed Representations of Words and Phrases and their Compositionality. Advances in neural information processing systems. 3111–3119.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

3.1

May 16, 2023

3.0

Jul 13, 2021

2.3.2

Aug 1, 2019

2.3.1

Aug 1, 2019

2.3

Jul 31, 2019

2.2

Jun 7, 2019

This version

2.1

Jun 5, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cso-classifier-2.1.tar.gz (13.9 MB view details)

Uploaded Jun 5, 2019 Source

Built Distribution

cso_classifier-2.1-py3-none-any.whl (14.1 MB view details)

Uploaded Jun 5, 2019 Python 3

File details

Details for the file cso-classifier-2.1.tar.gz.

File metadata

Download URL: cso-classifier-2.1.tar.gz
Upload date: Jun 5, 2019
Size: 13.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for cso-classifier-2.1.tar.gz
Algorithm	Hash digest
SHA256	`6560f5801bf8d7b508a5f81f46581c160898facdd6b8d2cb354c857ad9287045`
MD5	`05b6712754d12d0710a83b2ce9f29f79`
BLAKE2b-256	`899fed19b853a5bb7762d1ab655ba4e539432d9e7eeb5b942d67b7c6033fbcb6`

See more details on using hashes here.

File details

Details for the file cso_classifier-2.1-py3-none-any.whl.

File metadata

Download URL: cso_classifier-2.1-py3-none-any.whl
Upload date: Jun 5, 2019
Size: 14.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for cso_classifier-2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37a08285eea0076432efcdaa26f16ca2cbdfd71a8772668b687e5d4ae43621e2`
MD5	`087237b64f0bb616df14ea0b3e6995c6`
BLAKE2b-256	`9dd32b4138a2b3721844c7f6d2bb764951c47ed17663859330ecae86daf589e5`

See more details on using hashes here.

cso-classifier 2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CSO-Classifier

Abstract

Table of contents

About

Requirements

Releases

v2.1

v2.0

v1.0

List of Files

Word2vec model and token-to-cso-combined file generation

Word Embedding generation

token-to-cso-combined file

Usage examples

Classifying a single paper (SP)

Sample Input (SP)

Run (SP)

Sample Output (SP)

Classifying in batch mode (BM)

Sample Input (BM)

Run (BM)

Sample Output (BM)

Parameters

License

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes