A set of python modules for cornel movie-dialogs corpus with storm
Project description
A set of python modules for cornel movie-dialogs corpus with storm.
Abstract
This module include some classes extending storm ORM for cornel movie-dialogs corpus data.
Install
pip install storm # if you not pip install cornel-movie-dialogs-corpus-storm
Setup
download corpus and unzip
generate database and insert with generate-mdcorpus-database.py
for example:
generate-mdcorpus-database.py --corpus-dir "cornell movie-dialogs corpus" corpus.db
Usage
from mdcorpus.orm import * from mdcorpus.parser import * ...
Class List
MovieTitlesMetadata
Genre
MovieGenreLine
MovieCharactersMetadata
MovieConversation
MovieLine
RawScriptUrl
Corpus Problem
This is memo when I dealt with corpus problems.
movie_titles_metadata.txt
I ignored an alphabet following year.
for example, line 34, 1989/I
I ignored duplication for genre data.
line 58, ['horror', 'mystery', 'mystery', 'sci-fi', 'sci-fi']
Code Problem
I use Python2.7 and I don’t know how to use codecs module.(Unicode HOWTO — Python 2.7ja1 documentation)
mime
convert text-code to utf-8 with Mi
before
cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt: text/plain; charset=iso-8859-1
chameleons.pdf: application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=iso-8859-1
movie_conversations.txt: text/plain; charset=us-ascii
movie_lines.txt: text/plain; charset=us-ascii
movie_titles_metadata.txt: text/plain; charset=iso-8859-1
raw_script_urls.txt: text/plain; charset=iso-8859-1
after
cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt: text/plain; charset=utf-8
chameleons.pdf: application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=utf-8
movie_conversations.txt: text/plain; charset=us-ascii
movie_lines.txt: text/plain; charset=us-ascii
movie_titles_metadata.txt: text/plain; charset=utf-8
raw_script_urls.txt: text/plain; charset=utf-8
movie_titles_metadata.txt
line 115, léon
movie_characters_metadata.txt
line 1727 - 1736, léon
result
sqlite> select * from movie_titles_metadata where title = 'léon'; sqlite> select * from movie_titles_metadata where title = 'l駮n'; 114|l駮n|1994|8.6|204901
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz.
File metadata
- Download URL: cornel-movie-dialogs-corpus-storm-0.1.1.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a365eca49ca32faf294073c9258525b6e441a4558f8af8d7ac7844ce0d01d59
|
|
| MD5 |
5350f41b48e9eaa7859d23b026f8f5ce
|
|
| BLAKE2b-256 |
f92e573329b79419ee5c8feb82e8e7f3aa8132c978c847aa9f5ca0b75bbdb36b
|