A set of python modules for cornel movie-dialogs corpus with storm
Project description
A set of python modules for cornel movie-dialogs corpus with storm.
Abstract
This module include some classes extending storm ORM for cornel movie-dialogs corpus data.
Install
pip install storm # if you not pip install cornel-movie-dialogs-corpus-storm
Setup
download corpus and unzip
generate database and insert with generate-mdcorpus-database.py
for example:
generate-mdcorpus-database.py --corpus-dir "cornell movie-dialogs corpus" corpus.db
Usage
from mdcorpus.orm import * from mdcorpus.parser import * ...
Class List
MovieTitlesMetadata
Genre
MovieGenreLine
MovieCharactersMetadata
MovieConversation
MovieLine
RawScriptUrl
Corpus Problem
This is memo when I dealt with corpus problems.
movie_titles_metadata.txt
I ignored an alphabet following year.
for example, line 34, 1989/I
I ignored duplication for genre data.
line 58, ['horror', 'mystery', 'mystery', 'sci-fi', 'sci-fi']
Code Problem
I use Python2.7 and I don’t know how to use codecs module.(Unicode HOWTO — Python 2.7ja1 documentation)
mime
convert text-code to utf-8 with Mi
before
cornell movie-dialogs corpus$ file --mime {(ls)} README.txt: text/plain; charset=iso-8859-1 chameleons.pdf: application/pdf; charset=binary movie_characters_metadata.txt: text/plain; charset=iso-8859-1 movie_conversations.txt: text/plain; charset=us-ascii movie_lines.txt: text/plain; charset=us-ascii movie_titles_metadata.txt: text/plain; charset=iso-8859-1 raw_script_urls.txt: text/plain; charset=iso-8859-1
after
cornell movie-dialogs corpus$ file --mime {(ls)} README.txt: text/plain; charset=utf-8 chameleons.pdf: application/pdf; charset=binary movie_characters_metadata.txt: text/plain; charset=utf-8 movie_conversations.txt: text/plain; charset=us-ascii movie_lines.txt: text/plain; charset=us-ascii movie_titles_metadata.txt: text/plain; charset=utf-8 raw_script_urls.txt: text/plain; charset=utf-8
movie_titles_metadata.txt
line 115, léon
movie_characters_metadata.txt
line 1727 - 1736, léon
result
sqlite> select * from movie_titles_metadata where title = 'léon'; sqlite> select * from movie_titles_metadata where title = 'l駮n'; 114|l駮n|1994|8.6|204901
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a365eca49ca32faf294073c9258525b6e441a4558f8af8d7ac7844ce0d01d59 |
|
MD5 | 5350f41b48e9eaa7859d23b026f8f5ce |
|
BLAKE2b-256 | f92e573329b79419ee5c8feb82e8e7f3aa8132c978c847aa9f5ca0b75bbdb36b |