lachesis automates the segmentation of a transcript into closed captions
Project description
lachesis automates the segmentation of a transcript into closed captions
Version: 0.0.3
Date: 2017-01-26
Developed by: Alberto Pettarin
License: the GNU Affero General Public License Version 3 (AGPL v3)
Contact: info@readbeyond.it
DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!
Goal
lachesis automates the segmentation of a transcript into closed captions (CCs).
The general idea is that writing a transcription (raw text) is easier and faster than writing CCs, especially if you need to respect constraints like a certain minimum/maximum number of characters per line, a maximum number of lines per CC, etc.
You can transcribe your video into raw text and lachesis will take on the job of segmenting the text into CCs for you. Once you have the CCs, you can use a forced aligner like aeneas to align them with the audio of your video, obtaining a subtitle file (SRT, TTML, VTT, etc.).
With lachesis and a forced aligner, the manual labor for producing CCs for a video is reduced to a. transcribing the video in raw text form, and b. checking the final CCs and audio alignment. Instead of transcribing from scratch, you can even start by checking/editing a rough transcription made by an automated speech recognition engine, like the “automatic CCs” from YouTube, speeding the process up further.
The “magic” behind lachesis consists in combining machine learning techniques like conditional random fields (CRF) and classical NLP tools like POS tagging and sentence segmentation to split the text into CC lines. The machine learning models are learned from existing, manually-edited, high-quality CCs, like those of TED/TEDx talks on YouTube. The NLP tools come from the well-established, free NLP libraries for Python listed below.
In summary, lachesis contains the following major functions:
download closed captions from YouTube;
parse closed caption TTML files (downloaded from YouTube);
add POS tags to a given text or closed caption file;
segment a given text into sentences;
segment a given text into closed captions (several algorithms are available);
train and use machine learning models to segment raw text into CC lines.
Installation
DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!
pip install lachesis
Installing dependencies
You might need additional packages, depending on how you plan to use lachesis:
lxml >= 3.6.0 for reading or downloading TTML files;
youtube-dl >= 2017.1.16 for downloading TTML files;
python-crfsuite >= 0.9.1 for training and using CRF-based splitters.
By design choice, none of the above dependencies is installed by pip install lachesis. If you want to install them all, you can use:
pip install lachesis[full]
Alternatively, manually install only the dependencies you need. (You can do it before or after installing lachesis, the order does not matter.)
Installing NLP Libraries
In addition to the dependencies listed above, to perform POS tagging and sentence segmentation lachesis can use one or more of the following libraries:
Pattern (install with pip install pattern, see here)
NLTK (install with pip install nltk, see here)
spaCy (install with pip install spacy, see here)
UDPipe (install with pip install ufal.udpipe, see here)
If you want to install them all, you can use:
pip install lachesis[nlp]
or [fullnlp] if you also want [full] as above.
Each NLP library also needs language models which you need to download/install separately. Consult the documentation of your NLP library for details.
lachesis expects the following directories in your home directory (you can symlink them, if you installed each NLP library in a different place):
~/lachesis_data/nltk_data for NLTK (see here);
~/lachesis_data/spacy_data for spaCy (see here);
~/lachesis_data/udpipe_data for UDPipe (see here).
The NLP library Pattern does not need a separate download of its language models, as they are bundled in the file you download when installing through pip install pattern.
The following table summarizes the languages supported by each library in their standard language models pack. (Additional languages might be supported by third party projects/downloads or added over time.)
Language / Library |
Pattern |
NLTK |
spaCy |
UDPipe |
---|---|---|---|---|
Arabic |
✓ |
|||
Basque |
✓ |
|||
Bulgarian |
✓ |
|||
Croatian |
✓ |
|||
Czech |
✓ |
✓ |
||
Danish |
✓ |
✓ |
||
Dutch |
✓ |
✓ |
✓ |
|
English |
✓ |
✓ |
✓ |
✓ |
Estonian |
✓ |
✓ |
||
Finnish |
✓ |
✓ |
||
French |
✓ |
✓ |
✓ |
|
German |
✓ |
✓ |
✓ |
✓ |
Gothic |
✓ |
|||
Greek |
✓ |
✓ |
||
Greek (ancient) |
✓ |
|||
Hebrew |
✓ |
|||
Hindi |
✓ |
|||
Hungarian |
✓ |
|||
Indonesian |
✓ |
|||
Irish |
✓ |
|||
Italian |
✓ |
✓ |
✓ |
|
Latin |
✓ |
|||
Norwegian |
✓ |
✓ |
||
Old Church Slavonic |
✓ |
|||
Persian |
✓ |
|||
Polish |
✓ |
✓ |
||
Portuguese |
✓ |
✓ |
||
Romanian |
✓ |
|||
Slovenian |
✓ |
✓ |
||
Spanish |
✓ |
✓ |
✓ |
|
Swedish |
✓ |
✓ |
||
Tamil |
✓ |
|||
Turkish |
✓ |
Usage
Parse an existing TTML file downloaded from YouTube
from lachesis.downloaders import Downloader
# parse a given TTML file downloaded from YouTube
ifp = "/tmp/ccs.ttml"
doc = Downloader.read_closed_captions(ifp, options={u"downloader": u"youtube"})
print(doc.language)
# print several representations of the CCs
print(doc.raw_string) # multi line string, similar to SRT but w/o ids or times
print(doc.raw_flat_clean_string) # single line string, w/o CC line marks
print(doc.raw.string(flat=True, eol=u"|")) # single line string, CC lines separated by '|' characters
Tokenize, split sentences, and POS tagging
from lachesis.elements import Document
from lachesis.language import Language
from lachesis.nlpwrappers import NLPEngine
# work on this Unicode string
s = u"Hello, World. This is a second sentence, with a comma too! And a third sentence."
# but you can also pass a list with pre-split sentences
# s = [u"Hello World.", u"This is a second sentence.", u"Third one, bla bla"]
# create a Text object from the Unicode string
doc = Document(raw=s, language=Language.ENGLISH)
# tokenize, split sentences, and POS tagging
# the best available NLP library will be chosen
nlp1 = NLPEngine()
nlp1.analyze(doc)
# the text has been divided into tokens, grouped in sentences
for s in doc.sentences:
print(s) # raw
print(s.string(tagged=True)) # tagged
print(s.string(raw=True, eol=u"|", eos=u"")) # raw w/o CC line and sentence marks
# explicitly specify the NLP library NLTK,
# other options include: "pattern", "spacy", "udpipe"
nlp2 = NLPEngine()
nlp2.analyze(doc, wrapper=u"nltk")
...
# if you need to analyze many documents,
# preload (and keep in cache) an NLP library,
# even different ones for different languages
nlp3 = NLPEngine(preload=[
(u"en", u"spacy"),
(u"de", u"nltk"),
(u"it", u"pattern"),
(u"fr", u"udpipe")
])
nlp3.analyze(doc)
...
Train a CRF model to segment raw text into CC lines
$ # /tmp/ccs/train contains several TTML files to learn from
$ # you can download them from YouTube using lachesis (see above)
$ ls /tmp/ccs/train
0001.ttml
0002.ttml
...
$ # extract features and labels from them:
$ python -m lachesis.ml.crf dump eng /tmp/ccs/train/ /tmp/ccs/train.pickle
...
$ # train the CRF model:
$ python -m lachesis.ml.crf train eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite
...
$ # evaluate the model on the training set
$ python -m lachesis.ml.crf test eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite
...
$ # you might want to evaluate on a test set, disjoint from the training set,
$ # that is, the test set contains CCs not seen during the training:
$ ls /tmp/css/test
1001.ttml
1002.ttml
...
$ python -m lachesis.ml.crf dump eng /tmp/ccs/test/ /tmp/ccs/test.pickle
$ python -m lachesis.ml.crf test eng /tmp/ccs/test.pickle /tmp/ccs/model.crfsuite
...
$ # now you can build a CRFSplitter
$ # with model_file_path="/tmp/ccs/model.crfsuite" as shown above
TODO: decide and document where pre-trained model files can be downloaded
License
lachesis is released under the terms of the GNU Affero General Public License Version 3. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.