Python library for sharing word embeddings
Project description
# VecShare: Framework for Sharing Word Embeddings
A Python library for word embedding query, selection and download. Read more about VecShare: https://bit.ly/VecShare.
## Prerequisites:
Before installing this library, install the datadotworld Python library:
```
pip install datadotworld
```
Configure the datadotworld library with your data.world API token.
Your token is obtainable on data.world under [Settings > Advanced](https://data.world/settings/advanced)
Set your data.world token:
```
dw config
```
To avoid resetting the token for every terminal instance, consider adding your token as a global environment variable to your bash profile or permanent environment variables.
## Installation:
Install the VecShare Python library:
```
pip install vecshare
```
See [**Advanced Setup**](#advanced-setup) for details on creating new indexers or signature methods.
## Supported Functions
The VecShare Python library currently supports:
* [`check`](#check-available-embeddings): See available embeddings
* [`format`](#embedding-upload-or-update): Autoformat a header to upload an embedding to the data store
* [`upload`](#embedding-upload-or-update): Upload a new embedding to the datastore
* [`update`](#embedding-upload-or-update): Update an existing embedding or its metadata
* [`query`](#embedding-query): Look up word vectors from a specific embedding
* [`extract`:](#embedding-extraction) Download word vectors for only the vocabulary of a specific corpus
* [`download`](#full-embedding-download): Download an entire shared embedding
### Check Available Embeddings
**`check()`:** Returns embeddings available with the current indexer as a queryable `pandas.DataFrame`.
The default indexer aggregates a set of embeddings by polling `data.world` weekly for datasets with the tag `vecshare`. Currently indexed embeddings are viewable at: `https://data.world/jaredfern/vecshare-indexer`.
See [**Advanced Setup**](#advanced-setup), if you would like to use a custom indexer.
**For Example:**
```python
>>> import vecshare as vs
>>> vs.check()
embedding_name dataset_name contributor \
0 reutersr8 jaredfern/reuters-word-embeddings jaredfern
1 reuters21578 jaredfern/reuters-word-embeddings jaredfern
2 brown jaredfern/brown-corpus jaredfern
3 glove_gigaword100d jaredfern/gigaword-glove-embedding jaredfern
4 oanc_written jaredfern/oanc-word-embeddings jaredfern
case_sensitive dimension embedding_type file_format vocab_size
0 False 100 word2vec csv 7821
1 False 100 word2vec csv 20203
2 False 100 word2vec csv 15062
3 False 100 glove csv 399922
4 False 100 word2vec csv 73127
```
### Embedding Upload or Update
Embeddings must be uploaded as a .csv file with a header in the format: ['text', 'd0', 'd1', ... 'd_n'], such that they can be properly indexed and accessed.
**`format(emb_path)`:** Reformats existing embeddings in the .csv format with a header in the correct format for tabular access.
* **emb_path (str):** Path to the embedding being formatted
**`upload(set_name, emb_path, metadata = {}, summary = None)`:** Create a new shared embedding on data.world
* **set_name (str):** Name of the new dataset on data.world in the form (data.world_username/dataset_name)
* **emb_path (str):** Path to embedding being uploaded
* **metadata (dict, opt):** Dictionary containing metadata fields and values '{metadata_field: value}'
* **summary (str, opt):** Optional embedding description
**`update(set_name, emb_path = "", metadata = {}, summary = "")`:** Update an existing shared embedding or its associated metadata
* **set_name (str):** Name of the new dataset on data.world in the form (data.world_username/dataset_name)
* **emb_path (str):** Path to embedding being uploaded
* **metadata (dict, opt):** Dictionary containing metadata fields and values '{metadata_field: value}'
* **summary (str, opt):** Optional embedding description
Alternatively, new embeddings can be added to the framework by uploading the embedding as a .csv file to data.world, and tagging the dataset with the <vecshare> tag. The default indexer will add new embedding sets weekly.
Metadata associated with the embedding can be added in the datasets description in the following format, `Field: Value`
**For example:**
```
Embedding Type: word2vec
Token Count: 6000000
Case Sensitive: False
```
### Embedding Selection
**`signatures.avgrank(inp_dir)`:** Returns the shared embedding most similar to the user's target corpus, using the AvgRank method described in the VecShare paper. *Note: Computation is performed locally. Users' corpora will not be shared with other users*
* **inp_dir (str):** Path to the directory containing the target corpus.
```python
>>> import vecshare.signatures as sigs
>>> sigs.avgrank('Test_Input')
u'reutersR8
```
Additional custom similarity and selection methods can be added. See ['Advanced Setup'](#advanced-setup).
### Embedding Query
**`query(words, emb_name, set_name = None, case_sensitive = False)`:** Returns a pandas DataFrame, such that each row specifies a word vector from the query.
* **words (list):** List of word vectors being requested
* **emb_name (str):** Title of the embedding containing the requested word vectors
* **set_name (str, opt):** Specify if multiple embeddings exist with the same emb_name
* **case_sensitive (bool):** Set to True if word vectors must exactly case match those in words
** Example:**
```python
>>> import vecshare as vs
>>> vs.query(['The', 'farm'], 'agriculture_40')
text d99 d98 d97 d96 d95 ... d1 d0
0 the -1.414755 0.414973 1.115698 0.034085 0.542921 ... 0.037287 -1.004704
1 farm 0.349535 -0.379208 -0.189476 2.776809 -0.099886 ... 0.067443 -1.391604
[2 rows x 101 columns]
```
### Embedding Extraction
**`def extract(emb_name, file_dir, set_name = None, download = False):`** Return a pandas DataFrame containing all available word vectors for the target corpora's vocabulary.
Parameters:
* **emb_name (str):** Title of the shared embedding
* **file_dir (str):** Directory containing the user's target corpora
* **set_name (str,opt):** Specify only if multiple embeddings exist with the same emb_name
* **download (bool,opt):** If True, the extracted embedding will be saved as a .csv
* **case_sensitive (bool):** Set to True if word vectors must exactly case match those in words
**For example:**
```python
>>> import vecshare as vs
>>> vs.extract('agriculture_40', 'Test_Input/reutersR8_all')
Embedding extraction begins.
100% (23584 of 23584) |################################| Elapsed Time: 0:01:04
Embedding successfully extracted.
text d99 d98 d97 d96 d95 ... \
0 designing -0.194328 -0.229856 0.455848 0.234053 -0.272354 ...
1 affiliated -0.446879 -0.519360 0.130626 0.034608 0.134680 ...
2 appropriately 0.106778 0.057186 -0.222296 0.101948 0.395122 ...
3 cincinnati -0.563716 -0.274534 0.120897 0.273457 0.383307 ...
4 choice 0.689276 1.586349 1.301351 -1.193058 -0.243053 ...
5 han -0.287583 0.237989 -0.141203 0.328414 0.401448 ...
6 begin 1.952841 -1.497073 -0.656650 2.443687 0.315941 ...
7 wednesday -1.591453 -1.419733 -0.758305 2.638620 0.323779 ...
8 wales -0.591623 -0.761353 -0.042557 -0.106776 0.004614 ...
9 much 1.971340 -2.316020 0.147194 -0.641963 -0.280868 ...
d14 d13 d12 d11 d10 d1 d0
0 0.432226 -0.023887 -0.246207 0.429862 0.268280 0.283950 0.218664
1 0.702217 -0.516346 0.273179 0.662874 0.106199 -0.011592 0.057832
2 -0.174151 -0.069734 -0.255887 0.070181 -0.163013 0.093490 0.028913
3 -0.189739 -0.089899 -0.048192 0.569139 0.595834 0.421905 -0.241777
4 -1.085993 -0.054178 1.156616 -1.449286 0.267787 0.677337 2.148856
5 -0.004664 -0.414933 -0.346377 -0.214976 0.201621 0.063539 -0.331673
6 1.587940 -0.258819 1.396479 0.637493 -1.476619 -0.487518 0.864765
7 0.190376 0.881103 0.966915 1.543105 1.974099 -0.807656 0.800163
8 -0.181255 0.005893 -0.718905 0.373082 0.784821 0.393715 -0.000517
9 1.348299 0.180225 1.686486 0.535154 -2.005099 -1.424234 -2.677770
[9320 rows x 101 columns]
```
### Full Embedding Download
**`download(emb_name, set_name=None):`** Returns the full embedding, containing all uploaded word vectors in the shared embedding and saves the embedding as a .csv file in the current directory
* **emb_name (str):** Title of the shared embedding
* **set_name (str, opt):** Specify if multiple embeddings exist with the same emb_name
**For example:**
```python
>>> import vecshare as vs
>>> vs.download('agriculture_40')
text d0 d1 d2 d3 d4 \
0 the 1.477964 0.016078 -0.193995 1.113142 0.765398
1 of -0.048878 -0.597735 0.196982 0.220966 1.463818
2 to 1.932197 1.587676 -0.321938 -0.592603 0.137684
3 in 0.294486 1.061131 -0.119670 0.611166 0.436337
4 said -0.609932 -0.481854 0.028189 0.755433 -0.493351
5 a 0.750953 0.342545 -0.758257 0.381944 0.824879
6 and 0.991821 -0.252496 0.011951 0.384948 0.505785
7 mln 0.215208 3.330005 0.458480 0.484309 1.128098
8 vs 0.512198 3.565070 -1.698517 0.813855 -0.002396
9 dlrs -0.026384 1.905773 1.313683 0.825797 1.981671
```
## Advanced Setup
### Custom Signature Methods:
Additional signature methods can be included in the library by downloading the library and adding to the `signatures.py` file. To incorporate new signatures into future releases of the official VecShare library, fork and merge your changes with the github repository.
### Custom Indexers:
Custom indexers can be added by updating the `indexer.py` file.
```python
INDEXER = <NEW INDEXER DATASET ID>
INDEX_FILE = <NAME OF THE INDEX FILE>
EMB_TAG = <EMB TAG>
```
A Python library for word embedding query, selection and download. Read more about VecShare: https://bit.ly/VecShare.
## Prerequisites:
Before installing this library, install the datadotworld Python library:
```
pip install datadotworld
```
Configure the datadotworld library with your data.world API token.
Your token is obtainable on data.world under [Settings > Advanced](https://data.world/settings/advanced)
Set your data.world token:
```
dw config
```
To avoid resetting the token for every terminal instance, consider adding your token as a global environment variable to your bash profile or permanent environment variables.
## Installation:
Install the VecShare Python library:
```
pip install vecshare
```
See [**Advanced Setup**](#advanced-setup) for details on creating new indexers or signature methods.
## Supported Functions
The VecShare Python library currently supports:
* [`check`](#check-available-embeddings): See available embeddings
* [`format`](#embedding-upload-or-update): Autoformat a header to upload an embedding to the data store
* [`upload`](#embedding-upload-or-update): Upload a new embedding to the datastore
* [`update`](#embedding-upload-or-update): Update an existing embedding or its metadata
* [`query`](#embedding-query): Look up word vectors from a specific embedding
* [`extract`:](#embedding-extraction) Download word vectors for only the vocabulary of a specific corpus
* [`download`](#full-embedding-download): Download an entire shared embedding
### Check Available Embeddings
**`check()`:** Returns embeddings available with the current indexer as a queryable `pandas.DataFrame`.
The default indexer aggregates a set of embeddings by polling `data.world` weekly for datasets with the tag `vecshare`. Currently indexed embeddings are viewable at: `https://data.world/jaredfern/vecshare-indexer`.
See [**Advanced Setup**](#advanced-setup), if you would like to use a custom indexer.
**For Example:**
```python
>>> import vecshare as vs
>>> vs.check()
embedding_name dataset_name contributor \
0 reutersr8 jaredfern/reuters-word-embeddings jaredfern
1 reuters21578 jaredfern/reuters-word-embeddings jaredfern
2 brown jaredfern/brown-corpus jaredfern
3 glove_gigaword100d jaredfern/gigaword-glove-embedding jaredfern
4 oanc_written jaredfern/oanc-word-embeddings jaredfern
case_sensitive dimension embedding_type file_format vocab_size
0 False 100 word2vec csv 7821
1 False 100 word2vec csv 20203
2 False 100 word2vec csv 15062
3 False 100 glove csv 399922
4 False 100 word2vec csv 73127
```
### Embedding Upload or Update
Embeddings must be uploaded as a .csv file with a header in the format: ['text', 'd0', 'd1', ... 'd_n'], such that they can be properly indexed and accessed.
**`format(emb_path)`:** Reformats existing embeddings in the .csv format with a header in the correct format for tabular access.
* **emb_path (str):** Path to the embedding being formatted
**`upload(set_name, emb_path, metadata = {}, summary = None)`:** Create a new shared embedding on data.world
* **set_name (str):** Name of the new dataset on data.world in the form (data.world_username/dataset_name)
* **emb_path (str):** Path to embedding being uploaded
* **metadata (dict, opt):** Dictionary containing metadata fields and values '{metadata_field: value}'
* **summary (str, opt):** Optional embedding description
**`update(set_name, emb_path = "", metadata = {}, summary = "")`:** Update an existing shared embedding or its associated metadata
* **set_name (str):** Name of the new dataset on data.world in the form (data.world_username/dataset_name)
* **emb_path (str):** Path to embedding being uploaded
* **metadata (dict, opt):** Dictionary containing metadata fields and values '{metadata_field: value}'
* **summary (str, opt):** Optional embedding description
Alternatively, new embeddings can be added to the framework by uploading the embedding as a .csv file to data.world, and tagging the dataset with the <vecshare> tag. The default indexer will add new embedding sets weekly.
Metadata associated with the embedding can be added in the datasets description in the following format, `Field: Value`
**For example:**
```
Embedding Type: word2vec
Token Count: 6000000
Case Sensitive: False
```
### Embedding Selection
**`signatures.avgrank(inp_dir)`:** Returns the shared embedding most similar to the user's target corpus, using the AvgRank method described in the VecShare paper. *Note: Computation is performed locally. Users' corpora will not be shared with other users*
* **inp_dir (str):** Path to the directory containing the target corpus.
```python
>>> import vecshare.signatures as sigs
>>> sigs.avgrank('Test_Input')
u'reutersR8
```
Additional custom similarity and selection methods can be added. See ['Advanced Setup'](#advanced-setup).
### Embedding Query
**`query(words, emb_name, set_name = None, case_sensitive = False)`:** Returns a pandas DataFrame, such that each row specifies a word vector from the query.
* **words (list):** List of word vectors being requested
* **emb_name (str):** Title of the embedding containing the requested word vectors
* **set_name (str, opt):** Specify if multiple embeddings exist with the same emb_name
* **case_sensitive (bool):** Set to True if word vectors must exactly case match those in words
** Example:**
```python
>>> import vecshare as vs
>>> vs.query(['The', 'farm'], 'agriculture_40')
text d99 d98 d97 d96 d95 ... d1 d0
0 the -1.414755 0.414973 1.115698 0.034085 0.542921 ... 0.037287 -1.004704
1 farm 0.349535 -0.379208 -0.189476 2.776809 -0.099886 ... 0.067443 -1.391604
[2 rows x 101 columns]
```
### Embedding Extraction
**`def extract(emb_name, file_dir, set_name = None, download = False):`** Return a pandas DataFrame containing all available word vectors for the target corpora's vocabulary.
Parameters:
* **emb_name (str):** Title of the shared embedding
* **file_dir (str):** Directory containing the user's target corpora
* **set_name (str,opt):** Specify only if multiple embeddings exist with the same emb_name
* **download (bool,opt):** If True, the extracted embedding will be saved as a .csv
* **case_sensitive (bool):** Set to True if word vectors must exactly case match those in words
**For example:**
```python
>>> import vecshare as vs
>>> vs.extract('agriculture_40', 'Test_Input/reutersR8_all')
Embedding extraction begins.
100% (23584 of 23584) |################################| Elapsed Time: 0:01:04
Embedding successfully extracted.
text d99 d98 d97 d96 d95 ... \
0 designing -0.194328 -0.229856 0.455848 0.234053 -0.272354 ...
1 affiliated -0.446879 -0.519360 0.130626 0.034608 0.134680 ...
2 appropriately 0.106778 0.057186 -0.222296 0.101948 0.395122 ...
3 cincinnati -0.563716 -0.274534 0.120897 0.273457 0.383307 ...
4 choice 0.689276 1.586349 1.301351 -1.193058 -0.243053 ...
5 han -0.287583 0.237989 -0.141203 0.328414 0.401448 ...
6 begin 1.952841 -1.497073 -0.656650 2.443687 0.315941 ...
7 wednesday -1.591453 -1.419733 -0.758305 2.638620 0.323779 ...
8 wales -0.591623 -0.761353 -0.042557 -0.106776 0.004614 ...
9 much 1.971340 -2.316020 0.147194 -0.641963 -0.280868 ...
d14 d13 d12 d11 d10 d1 d0
0 0.432226 -0.023887 -0.246207 0.429862 0.268280 0.283950 0.218664
1 0.702217 -0.516346 0.273179 0.662874 0.106199 -0.011592 0.057832
2 -0.174151 -0.069734 -0.255887 0.070181 -0.163013 0.093490 0.028913
3 -0.189739 -0.089899 -0.048192 0.569139 0.595834 0.421905 -0.241777
4 -1.085993 -0.054178 1.156616 -1.449286 0.267787 0.677337 2.148856
5 -0.004664 -0.414933 -0.346377 -0.214976 0.201621 0.063539 -0.331673
6 1.587940 -0.258819 1.396479 0.637493 -1.476619 -0.487518 0.864765
7 0.190376 0.881103 0.966915 1.543105 1.974099 -0.807656 0.800163
8 -0.181255 0.005893 -0.718905 0.373082 0.784821 0.393715 -0.000517
9 1.348299 0.180225 1.686486 0.535154 -2.005099 -1.424234 -2.677770
[9320 rows x 101 columns]
```
### Full Embedding Download
**`download(emb_name, set_name=None):`** Returns the full embedding, containing all uploaded word vectors in the shared embedding and saves the embedding as a .csv file in the current directory
* **emb_name (str):** Title of the shared embedding
* **set_name (str, opt):** Specify if multiple embeddings exist with the same emb_name
**For example:**
```python
>>> import vecshare as vs
>>> vs.download('agriculture_40')
text d0 d1 d2 d3 d4 \
0 the 1.477964 0.016078 -0.193995 1.113142 0.765398
1 of -0.048878 -0.597735 0.196982 0.220966 1.463818
2 to 1.932197 1.587676 -0.321938 -0.592603 0.137684
3 in 0.294486 1.061131 -0.119670 0.611166 0.436337
4 said -0.609932 -0.481854 0.028189 0.755433 -0.493351
5 a 0.750953 0.342545 -0.758257 0.381944 0.824879
6 and 0.991821 -0.252496 0.011951 0.384948 0.505785
7 mln 0.215208 3.330005 0.458480 0.484309 1.128098
8 vs 0.512198 3.565070 -1.698517 0.813855 -0.002396
9 dlrs -0.026384 1.905773 1.313683 0.825797 1.981671
```
## Advanced Setup
### Custom Signature Methods:
Additional signature methods can be included in the library by downloading the library and adding to the `signatures.py` file. To incorporate new signatures into future releases of the official VecShare library, fork and merge your changes with the github repository.
### Custom Indexers:
Custom indexers can be added by updating the `indexer.py` file.
```python
INDEXER = <NEW INDEXER DATASET ID>
INDEX_FILE = <NAME OF THE INDEX FILE>
EMB_TAG = <EMB TAG>
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vecshare-1.0.5.tar.gz
(15.6 kB
view hashes)