SpeakLeash agnostic dataset for Polish
Project description
UPDATE 05.05.2024:
Due to the changes related with the hosting, it is recommended to update the version of the package to the newest one, using command:
pip install --upgrade speakleash
SpeakLeash is a lightweight library providing datasets for the Polish language and tools to make them useful:
- Website: https://speakleash.org/
- Datasets: https://speakleash.org/dashboard/
- Source code: https://github.com/speakleash/speakleash
- Data in action: https://github.com/speakleash/speakleash-examples
- Bug reports: https://github.com/speakleash/speakleash/issues
Installation
Speakleash package can be installed from PyPi and has to be installed in a virtual environment:
pip install speakleash
Basic Usage
If you just want to see the details of the datasets:
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
for d in sl.datasets:
size_mb = round(d.characters/1024/1024)
print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))
You can use individual properties (e.g.:characters, documents), but you can display the entire manifest:
sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)
If you chose one of them (.get(name of dataset)) then you will get a lot of text data:
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
wiki = sl.get("plwiki").data
for doc in wiki:
print(doc[:40])
If you also need meta data then use the ext_data property:
ds = sl.get("plwiki").ext_data
for doc in ds:
print(doc)
txt, meta = doc
print(meta.get("title"))
print(txt)
Popular meta data:
- title
- length
- sentences
- words
- verbs
- nouns
- symbols
- punctuations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speakleash-0.3.51-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c469f7f2a7466f4807b05b13beb7c34453b82f0556dd1cf7eee23061770beb8 |
|
MD5 | 919cbb3404480056c481ecd484ceba60 |
|
BLAKE2b-256 | 62d7b5891b38e57b89d6abe3d0938599b01adb0a357b6a5472280cef4b503e8d |