Skip to main content

lachesis automates the segmentation of a transcript into closed captions

Project description

lachesis automates the segmentation of a transcript into closed captions




pip install lachesis

TODO: add directions about installing model files and Python NLP libraries.


Tokenize, split sentences, and POS tagging:

from lachesis.elements import Text
from lachesis.nlpwrappers import NLPEngine

# work on this Unicode string
s = u"Hello, World. This is a second sentence, with a comma too! And a third sentence."

# but you can also pass a list with pre-split text
# s = [u"Hello World.", u"This is a second sentence.", u"Third one, bla bla"]

# create a Text object from the Unicode string
t = Text(s, language=u"eng")

# tokenize, split sentences, and POS tagging
# the best NLP library will be chosen,
# depending on the language of the text
nlp1 = NLPEngine()
for s in t.sentences:

# explicitly specify an NLP library
# in this case, use "nltk"
# (other options include: "pattern", "spacy", "udpipe")
nlp2 = NLPEngine()
nlp2.analyze(t, wrapper="nltk")

# preload NLP libraries
nlp3 = NLPEngine(preload=[
    ("eng", "spacy"),
    ("deu", "nltk"),
    ("ita", "pattern"),
    ("fra", "udpipe")

Download closed captions from YouTube or parse an existing TTML file:

from lachesis.downloaders import Downloader

# URL of the video
url = u""

# download English automatic CC, storing the raw TTML file in /tmp/
language = u"en"
options = { "auto": True, "output_file_path": "/tmp/auto.ttml" }
ccl = Downloader.download_closed_captions(url, language, options)

# download English manual CC
language = u"en"
options = { "auto": False }
ccl = Downloader.download_closed_captions(url, language, options)

# parse a given TTML file (downloaded from YouTube)
ifp = "/tmp/auto.ttml"
ccl = Downloader.read_closed_captions(ifp, options={u"downloader": u"youtube"})

# get various representations of the CCs
print(ccl.single_string)        # print as a single string, collapsing CCs and lines
print(ccl.plain_string)         # print as a plain string, one CC per row and collapsed lines
print(ccl.cc_string)            # print as blank-separated, multiple line, SRT-like string
                                # (but without timings and ids)


lachesis is released under the terms of the GNU Affero General Public License Version 3. See the LICENSE file for details.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lachesis- (33.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page