Skip to main content

Citation data export and analysis from web of science

Project description

# wos-statistics

Python library for collecting data from web of science and exporting summary in terms of all kinds of requirements on the citation statistics. The crawling part is implemented with aiohttp for a better speed.

## Installation

Python 3.6+ is supported.

## Quick Start

```python
from pywos.crawler import WosQuery, construct_search
from pywos.analysis import Papers
import asyncio

# get data
qd = construct_search(AI="D-3202-2011", PY="2014-2018") # construct the query for papers
wq = WosQuery(querydict=qd) # create the crawler object based on the query
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(wq.main(path="data.json")) # use main function of the object to download paper metadata and save them in the path
loop.run_until_complete(task) # here we go

# analyse data
p = Papers("data.json") # fetch data from the path just now
p.show(['Last, First', 'Last, F.'],['flast@abcu.edu.cn'], ['2017','2018']) # generate the summary on citations in the form of pandas dataframe
```

## Usage

### query part

First, it based on a legitimate query on web of science to download metadata of papers. You should provide a query dict for the crawl class. `value(select[n])` corresponds the nth query conditions, eg. `AU` for author's name, `AI` for identifier of authors, `PY` for publication year range, etc. `value(input[n])` corresponds the nth query values, eg. the name of the author or the year range 2012-2018. If there are multiple conditions, `value(bool_[m]_[n])` should also be added, the values include `AND`, `OR`,`NOT` indicating how to combine different search conditions. Besides, `fieldCount` should be updated to the number of query conditions. A legitimate query looks like `{'fieldCount': 2, 'value(input1)': 'D-1234-5678', 'value(select1)': 'AI', 'value(input2)': '2014-2018', 'value(select2)': 'PY', 'value(bool_1_2)': 'AND'}`. There is a quick function provided to construct such query dictionary easily for AND-connecting queries.

```python
from pywos.crawler import construct_search
construct_search(AI="D-1234-5678", PY="2018-2018")
# return value below
{'fieldCount': 2,
'value(bool_1_2)': 'AND',
'value(input1)': 'D-1234-5678',
'value(input2)': '2018-2018',
'value(select1)': 'AI',
'value(select2)': 'PY'}
```

### download part

Firstly, we should initialize the crawling object by providing the query dict and dict of headers for all http connections (optional, there is a default user-agent for headers).

```python
from pywos.crawler import WosQuery
wq = WosQuery(querydict = {'value(input1)': '',...}, headers= {'User-Agent':'blah-blah'})
```

The data collecting task is called by `WosQuery.main(path=)`. Parameters are all optional except `path`, which is the pathname to save output data. `citedcheck` is a bool, if set to be true, all citation papers of the query paper are also collected. And this is the basis for detailed analysis on citations, like citations by years and citations by others. Otherwise, the default value for `citedcheck` is false, in this case only total citation number of each query paper can be obtained. `limit` option gives the max number of connections in the http connection pool. The default number is 20. A larger number implies faster speed but also implies higher risk of connection failure due to the restriction by web of science. `limit=30` is tested successfully without connection failure, and such speed is enough to handle 1000 papers in around 1 minute. If the query task is too large, the better practice is turning on the parameter `savebyeach=True`, such that every paper within the query will be saved immediately after downloading. Therefore, when meeting connection failure, we can recover the task without fetching all data again. This is determined by the `masklist` paramter of main function. If `masklist` is provided, for all int number in this list, the corresponding paper is omitted to avoid repeating work. In sum, for a large task, we have the following parameters.

```python
import asyncio
task = asyncio.ensure_future(wq.main(path="prefix", citedcheck=True, savebyeach=True, limit=30))
```

To actually run the task is a thing on asyncio, see below.

```python
loop = asyncio.get_event_loop()

try:
loop.run_until_complete(task)

except KeyboardInterrupt as e:
asyncio.Task.all_tasks()
asyncio.gather(*asyncio.Task.all_tasks()).cancel()
loop.stop()
loop.run_forever()

finally:
loop.close()
```

If one would like to see the progress of the downloading, switch on the logging module.

```python
import logging
logger = logging.getLogger('pywos')
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
```

### analysis part

The `Papers()` class is designed for analysis on metadata of the papers. To initialize the object, provide a path of the metadata we saved using `WosQuery.main(path)`. One can also provide a list of pathes, such that all data of these jsons are imported. Besides, one can turn on `merge=True`, such that all files with the prefix `path-` will automatically imported, this is specifically suitable for data files saved using `WosQuery.main(path, savebyeach=True)`.

Generate the table of citation analysis by running `Papers.show(namelist, maillist, years)`. These lists are used for checking whether one is the first/correspondence author of the paper and count citations within `years` as recent citations, respectively. One can turn on `citedcheck=True` if the data to be analysed is obtained from `WosQuery.main(citedcheck=True)`. This includes further classification on citations in terms of years (recent citation) and authors (citation by others/self). The return object of `Papers.show()` is `pandas.DataFrame`, which can be easily transformed into other formats, including csv, html, tables in database and so on.

In sum,

```python
from pywos.analysis import Papers
p = Papers("path-prefix", merge=True)
p.show(["Last, First"], ["mail@server"], ["2018"], citedcheck=True)
```



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywos-0.0.1.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

pywos-0.0.1-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file pywos-0.0.1.tar.gz.

File metadata

  • Download URL: pywos-0.0.1.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/36.5.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.3

File hashes

Hashes for pywos-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a01ea4d7c83c2a16bc7a490c6844863033a9ab497ec9291b53c157b9b7461ec4
MD5 523f79d4c1156ebea72f0d5bd4a38d61
BLAKE2b-256 1915ccd6bf05f2092a37c8513d634dad94bc40a7272af9ba8f90f1a7b1df8ed8

See more details on using hashes here.

File details

Details for the file pywos-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pywos-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/36.5.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.3

File hashes

Hashes for pywos-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4daf95f3bfbcb05316a8fb486596e699fe61dc810ce6d3e99ead39be07905c03
MD5 0a8014826a4ee7a05a83d5f4151a45e7
BLAKE2b-256 ba8aa1588ffa02abb359132ba31e9e3242e8c7acc853433bcee751ab979044b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page