Skip to main content

SpeakLeash agnostic dataset for Polish

Project description

SpeakLeash

SpeakLeash agnostic dataset for Polish

Basic Usage

If you just want to see the details of the datasets

from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

for d in sl.datasets:
    print(d.name)
    for doc in d.data:
        size_mb = round(d.characters/1024/1024)
        print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))

You can use individual properties (e.g.:characters, documents), but you can display the entire manifest

sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)

If you chose one of them (.get(name of dataset)) then you will get a lot of text data ;-)

from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

wiki = sl.get("plwiki").data
for doc in wiki:
    print(doc[:40])

If you also need meta data then use the ext_data property


ds = sl.get("plwiki").ext_data
for doc in ds:
    print(doc)
    txt, meta = doc
    print(meta.get("title"))
    print(txt)


Popular meta data:

  • title
  • length
  • sentences
  • words
  • verbs
  • nouns
  • symbols
  • punctuations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speakleash-0.0.11.tar.gz (3.7 kB view hashes)

Uploaded Source

Built Distribution

speakleash-0.0.11-py3-none-any.whl (4.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page