Citation data export and analysis from web of science
Project description
# wos-statistics
Python library for collecting data from web of science and exporting summary in terms of all kinds of requirements on the citation statistics. The crawling part is implemented with aiohttp for a better speed.
## Installation
Python 3.6+ is supported.
## Quick Start
```python
from pywos.crawler import WosQuery, construct_search
from pywos.analysis import Papers
import asyncio
# get data
qd = construct_search(AI="D-3202-2011", PY="2014-2018") # construct the query for papers
wq = WosQuery(querydict=qd) # create the crawler object based on the query
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(wq.main(path="data.json")) # use main function of the object to download paper metadata and save them in the path
loop.run_until_complete(task) # here we go
# analyse data
p = Papers("data.json") # fetch data from the path just now
p.show(['Last, First', 'Last, F.'],['flast@abcu.edu.cn'], ['2017','2018']) # generate the summary on citations in the form of pandas dataframe
```
## Usage
### query part
First, it based on a legitimate query on web of science to download metadata of papers. You should provide a query dict for the crawl class. `value(select[n])` corresponds the nth query conditions, eg. `AU` for author's name, `AI` for identifier of authors, `PY` for publication year range, etc. `value(input[n])` corresponds the nth query values, eg. the name of the author or the year range 2012-2018. If there are multiple conditions, `value(bool_[m]_[n])` should also be added, the values include `AND`, `OR`,`NOT` indicating how to combine different search conditions. Besides, `fieldCount` should be updated to the number of query conditions. A legitimate query looks like `{'fieldCount': 2, 'value(input1)': 'D-1234-5678', 'value(select1)': 'AI', 'value(input2)': '2014-2018', 'value(select2)': 'PY', 'value(bool_1_2)': 'AND'}`. There is a quick function provided to construct such query dictionary easily for AND-connecting queries.
```python
from pywos.crawler import construct_search
construct_search(AI="D-1234-5678", PY="2018-2018")
# return value below
{'fieldCount': 2,
'value(bool_1_2)': 'AND',
'value(input1)': 'D-1234-5678',
'value(input2)': '2018-2018',
'value(select1)': 'AI',
'value(select2)': 'PY'}
```
### download part
Firstly, we should initialize the crawling object by providing the query dict and dict of headers for all http connections (optional, there is a default user-agent for headers).
```python
from pywos.crawler import WosQuery
wq = WosQuery(querydict = {'value(input1)': '',...}, headers= {'User-Agent':'blah-blah'})
```
The data collecting task is called by `WosQuery.main(path=)`. Parameters are all optional except `path`, which is the pathname to save output data. `citedcheck` is a bool, if set to be true, all citation papers of the query paper are also collected. And this is the basis for detailed analysis on citations, like citations by years and citations by others. Otherwise, the default value for `citedcheck` is false, in this case only total citation number of each query paper can be obtained. `limit` option gives the max number of connections in the http connection pool. The default number is 20. A larger number implies faster speed but also implies higher risk of connection failure due to the restriction by web of science. `limit=30` is tested successfully without connection failure, and such speed is enough to handle 1000 papers in around 1 minute. If the query task is too large, the better practice is turning on the parameter `savebyeach=True`, such that every paper within the query will be saved immediately after downloading. Therefore, when meeting connection failure, we can recover the task without fetching all data again. This is determined by the `masklist` paramter of main function. If `masklist` is provided, for all int number in this list, the corresponding paper is omitted to avoid repeating work. In sum, for a large task, we have the following parameters.
```python
import asyncio
task = asyncio.ensure_future(wq.main(path="prefix", citedcheck=True, savebyeach=True, limit=30))
```
To actually run the task is a thing on asyncio, see below.
```python
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(task)
except KeyboardInterrupt as e:
asyncio.Task.all_tasks()
asyncio.gather(*asyncio.Task.all_tasks()).cancel()
loop.stop()
loop.run_forever()
finally:
loop.close()
```
If one would like to see the progress of the downloading, switch on the logging module.
```python
import logging
logger = logging.getLogger('pywos')
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
```
### analysis part
The `Papers()` class is designed for analysis on metadata of the papers. To initialize the object, provide a path of the metadata we saved using `WosQuery.main(path)`. One can also provide a list of pathes, such that all data of these jsons are imported. Besides, one can turn on `merge=True`, such that all files with the prefix `path-` will automatically imported, this is specifically suitable for data files saved using `WosQuery.main(path, savebyeach=True)`.
Generate the table of citation analysis by running `Papers.show(namelist, maillist, years)`. These lists are used for checking whether one is the first/correspondence author of the paper and count citations within `years` as recent citations, respectively. One can turn on `citedcheck=True` if the data to be analysed is obtained from `WosQuery.main(citedcheck=True)`. This includes further classification on citations in terms of years (recent citation) and authors (citation by others/self). The return object of `Papers.show()` is `pandas.DataFrame`, which can be easily transformed into other formats, including csv, html, tables in database and so on.
In sum,
```python
from pywos.analysis import Papers
p = Papers("path-prefix", merge=True)
p.show(["Last, First"], ["mail@server"], ["2018"], citedcheck=True)
```
Python library for collecting data from web of science and exporting summary in terms of all kinds of requirements on the citation statistics. The crawling part is implemented with aiohttp for a better speed.
## Installation
Python 3.6+ is supported.
## Quick Start
```python
from pywos.crawler import WosQuery, construct_search
from pywos.analysis import Papers
import asyncio
# get data
qd = construct_search(AI="D-3202-2011", PY="2014-2018") # construct the query for papers
wq = WosQuery(querydict=qd) # create the crawler object based on the query
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(wq.main(path="data.json")) # use main function of the object to download paper metadata and save them in the path
loop.run_until_complete(task) # here we go
# analyse data
p = Papers("data.json") # fetch data from the path just now
p.show(['Last, First', 'Last, F.'],['flast@abcu.edu.cn'], ['2017','2018']) # generate the summary on citations in the form of pandas dataframe
```
## Usage
### query part
First, it based on a legitimate query on web of science to download metadata of papers. You should provide a query dict for the crawl class. `value(select[n])` corresponds the nth query conditions, eg. `AU` for author's name, `AI` for identifier of authors, `PY` for publication year range, etc. `value(input[n])` corresponds the nth query values, eg. the name of the author or the year range 2012-2018. If there are multiple conditions, `value(bool_[m]_[n])` should also be added, the values include `AND`, `OR`,`NOT` indicating how to combine different search conditions. Besides, `fieldCount` should be updated to the number of query conditions. A legitimate query looks like `{'fieldCount': 2, 'value(input1)': 'D-1234-5678', 'value(select1)': 'AI', 'value(input2)': '2014-2018', 'value(select2)': 'PY', 'value(bool_1_2)': 'AND'}`. There is a quick function provided to construct such query dictionary easily for AND-connecting queries.
```python
from pywos.crawler import construct_search
construct_search(AI="D-1234-5678", PY="2018-2018")
# return value below
{'fieldCount': 2,
'value(bool_1_2)': 'AND',
'value(input1)': 'D-1234-5678',
'value(input2)': '2018-2018',
'value(select1)': 'AI',
'value(select2)': 'PY'}
```
### download part
Firstly, we should initialize the crawling object by providing the query dict and dict of headers for all http connections (optional, there is a default user-agent for headers).
```python
from pywos.crawler import WosQuery
wq = WosQuery(querydict = {'value(input1)': '',...}, headers= {'User-Agent':'blah-blah'})
```
The data collecting task is called by `WosQuery.main(path=)`. Parameters are all optional except `path`, which is the pathname to save output data. `citedcheck` is a bool, if set to be true, all citation papers of the query paper are also collected. And this is the basis for detailed analysis on citations, like citations by years and citations by others. Otherwise, the default value for `citedcheck` is false, in this case only total citation number of each query paper can be obtained. `limit` option gives the max number of connections in the http connection pool. The default number is 20. A larger number implies faster speed but also implies higher risk of connection failure due to the restriction by web of science. `limit=30` is tested successfully without connection failure, and such speed is enough to handle 1000 papers in around 1 minute. If the query task is too large, the better practice is turning on the parameter `savebyeach=True`, such that every paper within the query will be saved immediately after downloading. Therefore, when meeting connection failure, we can recover the task without fetching all data again. This is determined by the `masklist` paramter of main function. If `masklist` is provided, for all int number in this list, the corresponding paper is omitted to avoid repeating work. In sum, for a large task, we have the following parameters.
```python
import asyncio
task = asyncio.ensure_future(wq.main(path="prefix", citedcheck=True, savebyeach=True, limit=30))
```
To actually run the task is a thing on asyncio, see below.
```python
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(task)
except KeyboardInterrupt as e:
asyncio.Task.all_tasks()
asyncio.gather(*asyncio.Task.all_tasks()).cancel()
loop.stop()
loop.run_forever()
finally:
loop.close()
```
If one would like to see the progress of the downloading, switch on the logging module.
```python
import logging
logger = logging.getLogger('pywos')
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
```
### analysis part
The `Papers()` class is designed for analysis on metadata of the papers. To initialize the object, provide a path of the metadata we saved using `WosQuery.main(path)`. One can also provide a list of pathes, such that all data of these jsons are imported. Besides, one can turn on `merge=True`, such that all files with the prefix `path-` will automatically imported, this is specifically suitable for data files saved using `WosQuery.main(path, savebyeach=True)`.
Generate the table of citation analysis by running `Papers.show(namelist, maillist, years)`. These lists are used for checking whether one is the first/correspondence author of the paper and count citations within `years` as recent citations, respectively. One can turn on `citedcheck=True` if the data to be analysed is obtained from `WosQuery.main(citedcheck=True)`. This includes further classification on citations in terms of years (recent citation) and authors (citation by others/self). The return object of `Papers.show()` is `pandas.DataFrame`, which can be easily transformed into other formats, including csv, html, tables in database and so on.
In sum,
```python
from pywos.analysis import Papers
p = Papers("path-prefix", merge=True)
p.show(["Last, First"], ["mail@server"], ["2018"], citedcheck=True)
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pywos-0.0.1.tar.gz
(12.4 kB
view details)
Built Distribution
pywos-0.0.1-py3-none-any.whl
(14.5 kB
view details)
File details
Details for the file pywos-0.0.1.tar.gz
.
File metadata
- Download URL: pywos-0.0.1.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/36.5.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a01ea4d7c83c2a16bc7a490c6844863033a9ab497ec9291b53c157b9b7461ec4 |
|
MD5 | 523f79d4c1156ebea72f0d5bd4a38d61 |
|
BLAKE2b-256 | 1915ccd6bf05f2092a37c8513d634dad94bc40a7272af9ba8f90f1a7b1df8ed8 |
File details
Details for the file pywos-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: pywos-0.0.1-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/36.5.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4daf95f3bfbcb05316a8fb486596e699fe61dc810ce6d3e99ead39be07905c03 |
|
MD5 | 0a8014826a4ee7a05a83d5f4151a45e7 |
|
BLAKE2b-256 | ba8aa1588ffa02abb359132ba31e9e3242e8c7acc853433bcee751ab979044b3 |