lachesis automates the segmentation of a transcript into closed captions

## Project description

lachesis automates the segmentation of a transcript into closed captions

DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!

## Goal

lachesis automates the segmentation of a transcript into closed captions (CCs).

The general idea is that writing a transcription (raw text) is easier and faster than writing CCs, especially if you need to respect constraints like a certain minimum/maximum number of characters per line, a maximum number of lines per CC, etc.

You can transcribe your video into raw text and lachesis will take on the job of segmenting the text into CCs for you. Once you have the CCs, you can use a forced aligner like aeneas to align them with the audio of your video, obtaining a subtitle file (SRT, TTML, VTT, etc.).

With lachesis and a forced aligner, the manual labor for producing CCs for a video is reduced to a. transcribing the video in raw text form, and b. checking the final CCs and audio alignment. Instead of transcribing from scratch, you can even start by checking/editing a rough transcription made by an automated speech recognition engine, like the “automatic CCs” from YouTube, speeding the process up further.

The “magic” behind lachesis consists in combining machine learning techniques like conditional random fields (CRF) and classical NLP tools like POS tagging and sentence segmentation to split the text into CC lines. The machine learning models are learned from existing, manually-edited, high-quality CCs, like those of TED/TEDx talks on YouTube. The NLP tools come from the well-established, free NLP libraries for Python listed below.

In summary, lachesis contains the following major functions:

• add POS tags to a given text or closed caption file;

• segment a given text into sentences;

• segment a given text into closed captions (several algorithms are available);

• train and use machine learning models to segment raw text into CC lines.

## Installation

DO NOT USE THIS PACKAGE IN PRODUCTION UNTIL IT REACHES v1.0.0 !!!

pip install lachesis

### Installing dependencies

You might need additional packages, depending on how you plan to use lachesis:

• python-crfsuite >= 0.9.1 for training and using CRF-based splitters.

By design choice, none of the above dependencies is installed by pip install lachesis. If you want to install them all, you can use:

pip install lachesis[full]

Alternatively, manually install only the dependencies you need. (You can do it before or after installing lachesis, the order does not matter.)

### Installing NLP Libraries

In addition to the dependencies listed above, to perform POS tagging and sentence segmentation lachesis can use one or more of the following libraries:

• Pattern (install with pip install pattern, see here)

• NLTK (install with pip install nltk, see here)

• spaCy (install with pip install spacy, see here)

• UDPipe (install with pip install ufal.udpipe, see here)

If you want to install them all, you can use:

pip install lachesis[nlp]

or [fullnlp] if you also want [full] as above.

Each NLP library also needs language models which you need to download/install separately. Consult the documentation of your NLP library for details.

lachesis expects the following directories in your home directory (you can symlink them, if you installed each NLP library in a different place):

• ~/lachesis_data/nltk_data for NLTK (see here);

• ~/lachesis_data/spacy_data for spaCy (see here);

• ~/lachesis_data/udpipe_data for UDPipe (see here).

The NLP library Pattern does not need a separate download of its language models, as they are bundled in the file you download when installing through pip install pattern.

The following table summarizes the languages supported by each library in their standard language models pack. (Additional languages might be supported by third party projects/downloads or added over time.)

Language / Library

Pattern

NLTK

spaCy

UDPipe

Arabic

Basque

Bulgarian

Croatian

Czech

Danish

Dutch

English

Estonian

Finnish

French

German

Gothic

Greek

Greek (ancient)

Hebrew

Hindi

Hungarian

Indonesian

Irish

Italian

Latin

Norwegian

Old Church Slavonic

Persian

Polish

Portuguese

Romanian

Slovenian

Spanish

Swedish

Tamil

Turkish

## Usage

from lachesis.downloaders import Downloader
from lachesis.language import Language

# set URL of the video and language of the CCs
language = Language.ENGLISH

options = { "auto": True }
print(doc)

options = { "auto": False, "output_file_path": "/tmp/ccs.ttml" }
print(doc)

from lachesis.downloaders import Downloader

ifp = "/tmp/ccs.ttml"
print(doc.language)

# print several representations of the CCs
print(doc.raw_string)                       # multi line string, similar to SRT but w/o ids or times
print(doc.raw_flat_clean_string)            # single line string, w/o CC line marks
print(doc.raw.string(flat=True, eol=u"|"))  # single line string, CC lines separated by '|' characters

### Tokenize, split sentences, and POS tagging

from lachesis.elements import Document
from lachesis.language import Language
from lachesis.nlpwrappers import NLPEngine

# work on this Unicode string
s = u"Hello, World. This is a second sentence, with a comma too! And a third sentence."

# but you can also pass a list with pre-split sentences
# s = [u"Hello World.", u"This is a second sentence.", u"Third one, bla bla"]

# create a Text object from the Unicode string
doc = Document(raw=s, language=Language.ENGLISH)

# tokenize, split sentences, and POS tagging
# the best available NLP library will be chosen
nlp1 = NLPEngine()
nlp1.analyze(doc)

# the text has been divided into tokens, grouped in sentences
for s in doc.sentences:
print(s)                                        # raw
print(s.string(tagged=True))                    # tagged
print(s.string(raw=True, eol=u"|", eos=u""))    # raw w/o CC line and sentence marks

# explicitly specify the NLP library NLTK,
# other options include: "pattern", "spacy", "udpipe"
nlp2 = NLPEngine()
nlp2.analyze(doc, wrapper=u"nltk")
...

# if you need to analyze many documents,
# preload (and keep in cache) an NLP library,
# even different ones for different languages
(u"en", u"spacy"),
(u"de", u"nltk"),
(u"it", u"pattern"),
(u"fr", u"udpipe")
])
nlp3.analyze(doc)
...

### Split into closed captions

from lachesis.elements import Document
from lachesis.language import Language
from lachesis.nlpwrappers import NLPEngine
from lachesis.splitters import CRFSplitter
from lachesis.splitters import GreedySplitter

# create a document from a raw string
s = u"Hello, World. This is a second sentence, with a comma too! And a third sentence."
doc = Document(raw=s, language=Language.ENGLISH)

# analyze it using the NLP library Pattern
nlpe = NLPEngine()
nlpe.analyze(doc, wrapper=u"pattern")

# feed the document into the CRF splitter (max 42 chars/line, max 2 lines/cc)
spl = CRFSplitter(doc.language, 42, 2)
spl.split(doc)

# print the segmented CCs
for cc in doc.ccs:
for line in cc.elements:
print(line)
print(u"")

# the default location for CRF model files is ~/lachesis_data/crf_data/
# but you can also specify a different path
spl = CRFSplitter(doc.language, 42, 2, model_file_path="/tmp/yourmodel.crfsuite")
spl.split(doc)

# if you do not have pycrfsuite installed
# or the CRF model file for the document language,
# you can use the GreedySplitter
gs = GreedySplitter(doc.language, 42, 2)
gs.split(doc)

### Train a CRF model to segment raw text into CC lines

$# /tmp/ccs/train contains several TTML files to learn from$ # you can download them from YouTube using lachesis (see above)
$ls /tmp/ccs/train 0001.ttml 0002.ttml ...$ # extract features and labels from them:
$python -m lachesis.ml.crf dump eng /tmp/ccs/train/ /tmp/ccs/train.pickle ...$ # train the CRF model:
$python -m lachesis.ml.crf train eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite ...$ # evaluate the model on the training set
$python -m lachesis.ml.crf test eng /tmp/ccs/train.pickle /tmp/ccs/model.crfsuite ...$ # you might want to evaluate on a test set, disjoint from the training set,
$# that is, the test set contains CCs not seen during the training:$ ls /tmp/css/test
1001.ttml
1002.ttml
...
$python -m lachesis.ml.crf dump eng /tmp/ccs/test/ /tmp/ccs/test.pickle$ python -m lachesis.ml.crf test eng /tmp/ccs/test.pickle /tmp/ccs/model.crfsuite
...
$# now you can build a CRFSplitter$ # with model_file_path="/tmp/ccs/model.crfsuite" as shown above

lachesis is released under the terms of the GNU Affero General Public License Version 3. See the LICENSE file for details.

## Project details

Uploaded source