ContentAI Metadata Flattening Service
Project description
metadata-flatten-extractor
A method to flatten generated JSON data into timed CSV events in support of analytic workflows within the ContentAI Platform, published as the extractor dsai_metadata_flatten. There is also a pypi package of this package published for easy incorporation in other projects.
Getting Started
NOTE: Not all flattening functions will respect/obey properties defined here.
force_overwrite - (bool) - force existing files to be overwritten (default=False)
compressed - (bool) - compress output CSVs instead of raw write (default=True, e.g. append ‘.gz’)
all_frames - (bool) - for video-based events, log all instances in box or just the center (default=False)
- time_offset - (int) - when merging events for an asset split into
multiple parts, time in seconds (default=0); negative numbers will cause a truncation (skip) of events happening before the zero time mark (added v0.7.1)
- time_offset_source - (str) - check for this one-line file path with
number of seconds offset according to time_offset rules; (added v1.4.0)
verbose - (bool) - verbose input/output configuration printing (default=False)
extractor - (string) - specify one extractor to flatten, skipping nested module import (default=all, e.g. dsai_metadata)
generator - (string) - cify one generator for output, skipping nested module import (*``=all, empty=none), e.g. ``flattened_csv)
Generators
CSV Schema (CSV)
One output of this flattening will be a set of CSV files if the flattened_csv is enabled as a generator. One file is created for each discovered/input parser/extractor. The standard schema for these CSV files has the following fields.
time_begin = time in seconds of event start
time_end = time in seconds of end (may be equal to time_start if instantaneous)
time_event = exact time in seconds (may be equal to time_start if instantaneous)
source_event = source media for event to add granularity for event inpact (e.g. face, video, audio, speech, image, ocr, script)
tag = simple text word or phrase
tag_type = descriptor for type of tag; e.g. tag=concept/label/emotion, keyword=special word, shot=segment, transcript=text, moderation=moderation, word=text/speech word, person=face or skeleton, phrase=long utterance, face=face emotion/properties, identity=face or speaker recognition, scene=semantic scenes/commercials/commercial_lead, brand=product or logo mention, emotion=visual or audio sentiment/emotion
score = confidence/probability
details = possible bounding box or other long-form (JSON-encoded) details
extractor = name of extractor for insight
Example Programmatic Use
While this library is primarily used as an extractor in ContentAI, it can be programmatically called within another extractor to simplify incoming data into a simple list for analysis. Several of these examples are available as code examples in the testing scripts.
This dictionary-based call example will parse output of the azure_videoindexer and return it as a dictionary only (do not generate CSV or JSON output).
from contentai_metadata_flatten.main import flatten
dict_result = flatten({"path_content": "content/jobs", "extractor": "azure_videoindexer",
"generator": "", "verbose": True, "path_result": ".", args=[])
This argument call example will parse all extractor outputs and generate a CSV.
from contentai_metadata_flatten.main import flatten
dict_result = flatten(args=["--path_content", "content/jobs/example.mp4",
"--generator", "flattened_csv", "--path_result": "content/flattened")
This low-level access to a parser allows more control over which file or directory is parsed by the library and no generators are called. This call example is the same as the first example except that it returns a DataFrame instead of a dictionary and may be slightly faster.
from contentai_metadata_flatten import parsers
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
list_parser = parsers.get_by_name("azure_videoindexer")
parser_instance = list_parser[0]['obj']("content/jobs", logger=logger)
config_default = parser_instance.default_config()
result_df = parser_instance.parse(config_default)
Another low-level access to parsers for only certain tag types. This call example allows the parsing of only certain tag types (below only those of type identity and face).
from contentai_metadata_flatten import parsers
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
list_parser = parsers.get_by_type(["face", "identity"])
for parser_obj in list_parser:
parser_instance = parser_obj['obj']("content/jobs", logger=logger)
config_default = parser_instance.default_config()
result_df = parser_instance.parse(config_default)
Return Value
The main function main.py::flatten now returns a richer dictionary (v1.3.0). For programatic callers of the function the dictionary object contains a data property (all of the flattened data as a list) and a generated property which contains a list of nested dictionaries indicating generated outptu (if enabled). An example output below demonstrates the flattened results as well as two enabled generators.
{'data': [
{'tag': 'Clock', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.08157, 'details': '{"model": "/m/01x3z"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'},
{'tag': 'Sine wave', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.07586, 'details': '{"model": "/m/01v_m0"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'},
{'tag': 'Tick-tock', 'time_begin': 0, 'time_end': 1, 'time_event': 0, 'score': 0.07297, 'details': '{"model": "/m/07qjznl"}', 'source_event': 'audio', 'tag_type': 'tag', 'extractor': 'example_extractor'},
... ]
'generated': [
{'generator': 'flattened_csv', 'path': 'testme/example_extractor.csv.gz'},
{'generator': 'wbTimeTaggedMetadata', 'path': 'testme/wbTimeTaggedMetadata.json.gz'}]
}
Execution and Deployment
This package is meant to be run as a one-off processing tool that aggregates the insights of other extractors.
command-line standalone
Run the code as if it is an extractor. In this mode, configure a few environment variables to let the code know where to look for content.
One can also run the command-line with a single argument as input and optionally ad runtime configuration (see runtime variables) as part of the EXTRACTOR_METADATA variable as JSON.
EXTRACTOR_METADATA='{"compressed":true}'
Locally Run on Results
For utility, the above line has been wrapped in the bash script run_local.sh.
EXTRACTOR_METADATA='$3' EXTRACTOR_NAME=metadata-flatten EXTRACTOR_JOB_ID=1 \
EXTRACTOR_CONTENT_PATH=$1 EXTRACTOR_CONTENT_URL=file://$1 EXTRACTOR_RESULT_PATH=`pwd`/results/$2 \
python -u main.py
This allows a simplified command-line specification of a run configuration, which also allows the passage of metadata into a configuration.
Normal result generation into compressed CSVs (with overwrite).
./run_local.sh data/wHaT3ver1t1s results/
Result generation with environment variables and integration of results from a file that was split at an offset of three hours.
./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/ '{"force_overwrite":false,"time_offset":10800}'
Result generation from a single extractor, with its nested directory explicitly specified. (added v0.6.1)
./run_local.sh results/dsai_metadata results/ '{"extractor":"dsai_metadata"}'
Local Runs with Timing Offsets
The script run_local.sh also searches for a text file called timing.txt in each source directory. If found, it will offset all results by the specified number of seconds before saving them to disk. Also, negative numbers will cause a truncation (skip) of events happening before the zero time mark. (added v0.7.1)
This capability may be useful if you have to manually split a file into multiple smaller files at a pre-determined time offset (e.g. three hours -> 10800 in timing.txt). (added v0.5.2)
echo "10800" > 1XMDAz9w8T1JFEKHRuNunQhRWL1/timing.txt
./run_local.sh results/1XMDAz9w8T1JFEKHRuNunQhRWL1/ results/
Afterwards, new results can be added arbitrarily and the script can be rerun in the same directory to accomodate different timing offsets.
Example demonstrating integration of multiple output directories.
find results -type d -d 1 | xargs -I {} ./run_local.sh {} results/
ContentAI
Deployment
Deployment is easy and follows standard ContentAI steps.
contentai deploy --cpu 256 --memory 512 metadata-flatten
Deploying...
writing workflow.dot
done
Alternatively, you can pass an image name to reduce rebuilding a docker instance.
docker build -t metadata-deploy
contentai deploy metadata-flatten --cpu 256 --memory 512 -i metadata-deploy
Locally Downloading Results
You can locally download data from a specific job for this extractor to directly analyze.
contentai data wHaT3ver1t1s --dir data
Run as an Extractor
contentai run https://bucket/video.mp4 -w 'digraph { aws_rekognition_video_celebs -> metadata_flatten}'
JOB ID: 1Tfb1vPPqTQ0lVD1JDPUilB8QNr
CONTENT: s3://bucket/video.mp4
STATE: complete
START: Fri Feb 15 04:38:05 PM (6 minutes ago)
UPDATED: 1 minute ago
END: Fri Feb 15 04:43:04 PM (1 minute ago)
DURATION: 4 minutes
EXTRACTORS
my_extractor
TASK STATE START DURATION
724a493 complete 5 minutes ago 1 minute
Or run it via the docker image…
docker run --rm -v `pwd`/:/x -e EXTRACTOR_CONTENT_PATH=/x/file.mp3 -e EXTRACTOR_RESULT_PATH=/x/result2 <docker_image>
View Extractor Logs (stdout)
contentai logs -f <my_extractor>
my_extractor Fri Nov 15 04:39:22 PM writing some data
Job complete in 4m58.265737799s
Testing
Testing is included via tox. To launch testing for the entire package, just run tox at the command line. Testing can also be run for a specific file within the package by setting the evironment variable TOX_ARGS.
TOX_ARG=test_basic.py tox
Future Development
- the remaining known extractors… openpose, dsai_tmstext_classifier_extractor,
dsai_vinyl_sound_ai, dsai_name_entity_extractor, aws_rekognition_video_segments
integration of viewership insights
creation of sentiment and mood-based insights (which tags most co-occur here?)
Changes
A method to flatten generated JSON data into timed CSV events in support of analytic workflows within the ContentAI Platform.
1.4
1.4.0
fix for timing offsets; don’t overwrite any output if timing offset indicator
1.4.1
add new dsai_ads_detector parser for predictive ad locations
1.3
1.3.3
minor fix for azure_videoindexer parsing, now first video shot can not contain a keyframe ?
1.3.2
minor fix for gcp_videointelligence_text_detection parsing
1.3.1
fix for no-output generators
fix complete output for returned dictionary of data
add richer documentation for library/api usage
1.3.0
update output of main parse function to return a dict instead of file listing
modify generator specification to allow ALL (* default) or NONE for outputs
1.2
1.2.2
- add parsers for gcp_videointelligence_text_detection, comskip_json, ibm_max_audio_classifier,
gcp_videointelligence_object_tracking, gcp_videointelligence_people_detection
improve testing to iterate over known set of data in testing dir
fix generator/parser retrieve for whole name matches, not partials
add documentation for new types, explicitly call out person tag_type
update the dsai_activity_emotions parser to return tag type emotion (matching that of other AWS, Azure parsers)
1.2.1
update azure_videoindexer for tag_type in detected brands (was speech, now video)
1.2.0
add unit-testing to package build
add command-line / parser input as complement to contentai-driven ENV variables
fix bugs around specification of result path or specific generator
1.1
1.1.8
fix issue about constant reference
fix run_local.sh script for extra run param config
fix querying for local files in non-contentai environments (regression since 1.1.0)
1.1.7
inclusion of other constants for compatibility with other packages
refactor/rename of parser classes to mandate a filename output prefix (e.g. flatten_)
add dsai_activity_emotions parser (a clone of dsai_activity_classifier)
1.1.6
remove applications, fork to new metatata-database source, to be posted at a pypi database package
1.1.4
name update for dsai_moderation_image extractor
1.1.3
hotfix for build distribution
fix for content creation in streamlit/browsing app
1.1.2
deployed extractor (docker fix) for updated namespace
1.1.1
docs update, testing fixes, version bump for publication
1.1.0
rename to contentai-metadata-flatten and publish to pypi as a package!
1.0
1.0.2
update documentation for Metadata Browser and Inventory Discovery app
1.0.1
add ability to parse input CSVs but not segment into shot
move to a single NLP library (spacy) for applications, using large model (with vectors)
1.0.0
add new dash/plotly driven quality check application
0.9
0.9.9
update to optimize the pull of asset keys
0.9.7
upgrade to use new contentai extractor package
update parser logic for safer key and data retrieval
0.9.6
upgrade to use new contentai extractor package
update parser logic for safer key and data retrieval
0.9.6
small tweaks/normalization of rounding factor for extractors
correct emotion souce type for azure
refactor app location for primary streamlit browser - fix mode discovery for modules with specific UX interface
update file listing to show data bundle files as well
refactor utilities script for reuse in other apps
0.9.5
update to parse new version of dsai_places
add new parser for detectron2 extractor
0.9.4
add static file serving to streamlit app, inspired by this streamlit issue discussion
modify some pages to point to downloadable tables (with button click)
create new download page/mode that lists the generated and source files
minor refactor of app’s docker image for better caching in local creation and testing
0.9.3
add dsai_moderation_text parser, update dsai_moderation parser for version robustness - add min threshold (0.05) to both moderation detectors
0.9.2
add recursion to file-based discovery method for processed assets - unify read of JSON and text files with internalaized function call in extractor base class
fix some extractors to use single name reference self.EXTRACTOR
0.9.1
fix transcript parsing in azure_videoindexer component
add speaker differentiation as an identity block in azure_videoindexer (similar to aws_transcribe)
0.9.0
add timeline viewing to the event_table mode of streamlit app
0.8
0.8.9
fixes to main streamlit app for partial extractors (e.g. missing identity, sparse brand)
0.8.8
add parser for dsai_moderation
0.8.7
add parser for dsai_activity_classifier
fix bug for faulty rejection of flatten_aws_transcribe results
0.8.6
add parsers for pyscenedetect, dsai_sceneboundary, aws_transcribe, yolo3, aws_rekognition_video_text_detect
add speaker identity (from speech) to gcp_videointelligence_speech_transcription
add type field (maps to tag_type) to output generated by wbTimeTaggedTmetadata generator - add hashing against data (e.g. box) within JSON metadata generator
0.8.5
add parsers for dsai_yt8m (youtube8M or mediapipe)
0.8.4
add parsers for dsai_activity_slowfast (activity) and dsai_places (scene/settings)
add source_type sub-field to event_table browsing mode
0.8.3
add manifest option to application for multiple assets
fix app docker file for placement/generation of code with a specific user ID
fix CI/CD integration for auto launch
fix app explorer bugs (derive ‘words’ from transcript/keywords if none)
0.8.2
hotfix for missing data in dsai_metadata parser
0.8.2
slight refactor of how parsers are discovered, to allow search by name or type (for use as package)
fix package import for contentai local file
switch tag_type of ocr to transcript and ocr for source_type (azure_videoindexer)
0.8.1
adding music parser dsai_musicnn for different audio regions
0.8.0
convert to package for other modules to install
switch document to RST from MD
add primitive testing capabilities (to be filled)
0.7
0.7.1
added truncation/trim of events before zero mark if time offset is negative
re-brand extractor as dsai_metadata_flatten for ownership consistency
0.7.0
create new set of generator class objects for varying output generator
add new generator input for limiting output to a single type
0.6
0.6.2
rename rekognition_face_collection to aws_rekognition_face_collection for consistency
0.6.1
split documentation and changes
add new cae_metadata type of parser
modify source_type of detected faces in azure_videoindexer to face
modify to add new extractor input for limit to scanning (skips sub-dir check)
0.6.0
adding CI/CD script for gitlab
validate usage as a flattening service
modify source_type for aws_rekognition_video_celebs to face
0.5
0.5.4
adding face_attributes visualization mode for exploration of face data
fix face processing to split out to tag_type as face with richer subtags
0.5.3
add labeling component to application (for video/image inspection)
fix shot duration computeation in application (do not overwrite original event duration)
add text-search for scanning named entities, words from transcript
0.5.2
fix bugs in gcp_videointelligence_logo_recognition (timing) and aws_rekognition_video_faces (face emotions)
add new detection of timing.txt for integration of multiple results and their potential time offsets
added verbose flag to input of main parser
rename rekognition_face_collection for consistency with other parsers
0.5.1
split app modules into different visualization modes (overview, event_table, brand_expansion)
brand_expansion uses kNN search to expand from shots with brands to similar shots and returns those brands
event_table allows specific exploration of identity (e.g. celebrities) and brands witih image/video playback
NOTE The new application requires scikit-learn to perform live indexing of features
dramatically improved frame targeting (time offset) for event instances (video) in application
0.5.0
split main function into sepearate auto-discovered modules
add new user collection detection parser rekognition_face_collection (custom face collections)
0.4
0.4.5
fixes for gcp moderation flattening
fixes for app rendering (switch most graphs to scatter plot)
make all charts interactive again
fix for time zone/browser challenge in rendering
0.4.4
fixes for azure_videoindexer parser
add sentiment and emotion summary
rework graph generation and add bran/entity search capability
0.4.3
add new azure_videoindexer parser
switch flattened reference from logo to brand; explicit to moderation
add parsing library pytimeparse for simpler ingest
fix bug to delete old data bundle if reference files are available
0.4.2
add new time_offset parameter to environment/run configuration
fix bug for reusing/rewriting existing files
add output prefix flatten_ to all generated CSVs to avoid collision with other extractor input
0.4.1
fix docker image for nlp tasks, fix stop word aggregation
0.4.0
adding video playback (and image preview) via inline command-line execution of ffmpeg in application
create new Dockerfile.app for all-in-one explorer app creation
0.3
0.3.2
argument input capabilities for exploration app
sort histograms in exploration app by count not alphabet
0.3.1
browsing bugfixes for exploration application
0.3.0
added new streamlit code for data explorer interface
be sure to install extra packages if using this app and starting from scratch (e.g. new flattened files)
if you’re working from a cached model, you can also drop it in from a friend
0.2
0.2.1
schema change for verb/action consistency time_start -> time_begin
add additional row field tag_type to describe type of tag (see generated-insights)
add processing type gcp_videointelligence_logo_recognition
allow compression as a requirement/input for generated files (compressed as input)
0.2.0
add initial package, requirements, docker image
add basic readme for usage example
processes types gcp_videointelligence_label, gcp_videointelligence_shot_change, gcp_videointelligence_explicit_content, gcp_videointelligence_speech_transcription, aws_rekognition_video_content_moderation, aws_rekognition_video_celebs, aws_rekognition_video_labels, aws_rekognition_video_faces, aws_rekognition_video_person_tracking,
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for contentai_metadata_flatten-1.4.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60eb0094e290dc3da9bb6c50a220ba2efa77a73431f3588f7f1767942176f010 |
|
MD5 | 4e77bdb28c5ec4acf648f061f322f97e |
|
BLAKE2b-256 | bb9b3a4f7a90772e3defa7c33fab568c94b66b9d4583aa1566a58638556794a1 |
Hashes for contentai_metadata_flatten-1.4.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ffe9c3554d46c6c70f2d9a41c489193f9302b7db57d6d4f449c78af87ac9a5e |
|
MD5 | 6dc0f8545dae1c03e46fa868d0422bf8 |
|
BLAKE2b-256 | 3c4cc8d375df89c8abd05804d4ae64febd85ee28f3b88ffec61b194eb3dbd115 |