SpeakLeash agnostic dataset for Polish
Project description
UPDATE 05.05.2024:
Due to the changes related with the hosting, it is recommended to update the version of the package to the newest one, using command:
pip install --upgrade speakleash
SpeakLeash is a lightweight library providing datasets for the Polish language and tools to make them useful.
- Website: https://speakleash.org/
- Datasets: https://speakleash.org/dashboard/
- Source code: https://github.com/speakleash/speakleash
- Data in action: https://github.com/speakleash/speakleash-examples
- Bug reports: https://github.com/speakleash/speakleash/issues
Installation
Speakleash package can be installed from PyPi and has to be installed in a virtual environment:
pip install speakleash
Basic Usage
If you just want to see the details of the datasets
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
for d in sl.datasets:
size_mb = round(d.characters/1024/1024)
print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))
You can use individual properties (e.g.:characters, documents), but you can display the entire manifest
sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)
If you chose one of them (.get(name of dataset)) then you will get a lot of text data ;-)
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
wiki = sl.get("plwiki").data
for doc in wiki:
print(doc[:40])
If you also need meta data then use the ext_data property
ds = sl.get("plwiki").ext_data
for doc in ds:
print(doc)
txt, meta = doc
print(meta.get("title"))
print(txt)
Popular meta data:
- title
- length
- sentences
- words
- verbs
- nouns
- symbols
- punctuations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speakleash-0.3.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e74a0b468a7aed767ac7d2eed8d0dfde68971d2f0cc749f7a617300853025551 |
|
MD5 | 6a300ea6b9ff8d5bfe9c0a4b8d8b1437 |
|
BLAKE2b-256 | c20eb961442c7159d21ee3ade98af5735fff1d444a2389cd4fc46d3eb0f22a3c |