Utility library for data analysis.
Project description
DataCake
_ | _ | _ | _ | _ | _ | _ |
---|---|---|---|---|---|---|
Table of Contents
Introduction
Data
Ingredients
Algorithms
Cake
$~$
An Introduction
Features Checklist
- [ ]
1. Flattens Deeply Nested Data
"How flat are we talking? It'll make your data flatter than a pancake!"
- [ ]
2. Without Unnecessarily Duplicating Data
"So, I get the whole cake and nothing but the cake? No more and no less!"
- [ ]
3. With No Loss of Information
"You can have your cake and eat it too? Every bit of it!"
- [ ]
4. Using MongoDB-Style Syntax
"Is it a piece of cake? You can bet your buns!"
- [ ]
5. Integrated with Numpy and Numba
"Is that a cherry on top? Why yes it is!"
$~$
The Data
To illustrate DataCake's features, I'll be utilizing a small sample from the [SQuAD][1] dataset.
A Sample
##### # Simplified Sample of the SQuAD Dataset # - Don't worry about analyzing this too much # - We will break it down step-by-step ##### data: dict = { "qas": [{ "question": "In what country is Normandy located?", "answers": [{ "text": "France", "answer_start": 159 }], "is_impossible": False }, { "question": "When were the Normans in Normandy?", "answers": [{ "text": "10th and 11th centuries", "answer_start": 94 }, { "text": "in the 10th and 11th centuries", "answer_start": 87 }], "is_impossible": False }, { "question": "From which countries did the Norse originate?", "answers": [{ "text": "Denmark, Iceland and Norway", "answer_start": 256 }], "is_impossible": False }, { "question": "Who was the Norse leader?", "answers": [{ "text": "Rollo", "answer_start": 308 }], "is_impossible": False }, { "question": "What century did the Normans first gain their separate identity?", "answers": [{ "text": "10th century", "answer_start": 671 }, { "text": "the first half of the 10th century", "answer_start": 649 }, { "is_impossible": False }] }, { "plausible_answers": [{ "text": "Normans", "answer_start": 4 }], "question": "Who gave their name to Normandy in the 1000's and 1100's", "answers": [], "is_impossible": True }, { "plausible_answers": [{ "text": "Normandy", "answer_start": 137 }], "question": "What is France a region of?", "answers": [], "is_impossible": True }], "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries." }
Some Questions
##### # List of Sample Questions and If Answering is Impossible # - For our needs, impossible questions are not desirable ##### questions = [{ "question": "In what country is Normandy located?", "is_impossible": False }, { "question": "When were the Normans in Normandy?", "is_impossible": False }, { "question": "From which countries did the Norse originate?", "is_impossible": False }, { "question": "Who was the Norse leader?", "is_impossible": False }, { "question": "What century did the Normans first gain their separate identity?", "is_impossible": False }, { "question": "Who gave their name to Normandy in the 1000's and 1100's?", "is_impossible": True }, { "question": "What is France a region of?", "is_impossible": True }]
Some Answers
##### - Answers # Here we have lists of answers from the possible questions. # Note that some questions have multiple correct answers. # Each answer also has its beginning index found in the context. # We want each of these answers, but we can get rid of the indexes. # Each answer needs to be associated to its appropriate question. # This needs to be done without any unnecessary duplication. answers = [ # In what country is Normandy located? [{ "text": "France", "answer_start": 159 }], # When were the Normans in Normandy? [{ "text": "10th and 11th centuries", "answer_start": 94 }, { "text": "in the 10th and 11th centuries", "answer_start": 87 }], # From which countries did the Norse originate? [{ "text": "Denmark, Iceland and Norway", "answer_start": 256 }], # Who was the Norse leader? [{ "text": "Rollo", "answer_start": 308 }], # What century did the Normans first gain their separate identity? [{ "text": "10th century", "answer_start": 671 }, { "text": "the first half of the 10th century", "answer_start": 649 }] ]##### - Plausible Answers # These are the plausible answers given with the impossible questions. # They do not adequately answer their questions. # We only want good answers extracted from the context. plausible = [ # Who gave their name to Normandy in the 1000's and 1100's? [{ "text": "Normans", "answer_start": 4 }], # What is France a region of? [{ "text": "Normandy", "answer_start": 137 }] ]
The Context
##### - The Context # This is where all of the questions and answers are derived from. # Each record of data will need to access it. # We want to do this without any unnecessary duplication. context = "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
$$
$$
The Ingredients
Need to write something.
Flattening
##### - Ingredients After Flattening # This is a sample of the special keys from the sample data. # Ingredients begin to take shape after the first flattening step. # The integer values represent indexes as they are derived from the data. # The string values represent the features derived from the data. # Each data value has an tuple of these ingredient keys. data = { "data": [{ "qas": [{ "question": "In what country is Normandy located?", ((0, "data"), (0, "qas"), (0, "question")) "answers": [{ "text": "France", ((0, "data"), (0, "qas"), (0, "answers"), (0, "text")) "answer_start": 159 ((0, "data"), (0, "qas"), (0, "answers"), (0, "answer_start")) }], "is_impossible": False ((0, "data"), (0, "qas"), (0, "is_impossible")) }, { "question": "When were the Normans in Normandy?", "answers": [{ "text": "10th and 11th centuries", "answer_start": 94 }, { "text": "in the 10th and 11th centuries", "answer_start": 87 }], "is_impossible": False }, { "question": "From which countries did the Norse originate?", "answers": [{ "text": "Denmark, Iceland and Norway", "answer_start": 256 }], "is_impossible": False }, { "question": "Who was the Norse leader?", "answers": [{ "text": "Rollo", "answer_start": 308 }], "is_impossible": False }, { "question": "What century did the Normans first gain their separate identity?", "answers": [{ "text": "10th century", "answer_start": 671 }, { "text": "the first half of the 10th century", "answer_start": 649 }, { "is_impossible": False }] }, { "plausible_answers": [{ "text": "Normans", "answer_start": 4 }], "question": "Who gave their name to Normandy in the 1000's and 1100's", "answers": [], "is_impossible": True }, { "plausible_answers": [{ "text": "Normandy", "answer_start": 137 }], "question": "What is France a region of?", "answers": [], "is_impossible": True }], "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries." ((0, "data"), (-1, "context")) }]} ingredients: list[tuple[int, str]] = [ ((-1, "context"),), ((0, "paragraphs"), (0, "questions"), (0, "q")), ((0, "paragraphs"), (0, "questions"), (0, "a")), ((0, "paragraphs"), (0, "questions"), (1, "q")), ((0, "paragraphs"), (0, "questions"), (1, "a")), ((0, "paragraphs"), (1, "questions"), (0, "q")), ((0, "paragraphs"), (1, "questions"), (0, "a")), ((0, "paragraphs"), (1, "questions"), (1, "q")), ((0, "paragraphs"), (1, "questions"), (1, "a")), ((0, "paragraphs"), (2, "questions"), (0, "q")), ((0, "paragraphs"), (2, "questions"), (0, "a")), ((0, "paragraphs"), (2, "questions"), (1, "q")), ((0, "paragraphs"), (2, "questions"), (1, "a")), ]
https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4216
-
Motivation from NIH Preliminary Data Post-Submission Update
-
Challenges
- Skewness
- Load Balancing Problem
- Not Very Well Explored
- Programming Mismatch
- Information Loss
- Partitioning
- Data Distribution Problem
- Data Duplication Problem
- Skewness
-
Solutions
- Shredding
- Flatten
- Map
- FlatMap
- Filter
-
New Approach: Index Bucketing
- Overview
- Trees
- Branches
- Leaves
- Overview
-
Setup
- Nested Data to Tree
- Branches know their leaves
- Leaves know their branches
-
Flattening Evaluations
- Index Bucketing
- Recursive Mapping
- Pandas Explosion
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datacake-1.0.2.tar.gz
.
File metadata
- Download URL: datacake-1.0.2.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e63b1c3a748f62908be1ba15e3458794d7649414abffc211d45fa218ef373ce2 |
|
MD5 | 7823e4777389ce6ffb1d7e6cecb7680f |
|
BLAKE2b-256 | ddb6197ea5c9f021560e83bbf2dd0ebaf5cbfae607007ed727e2bb5f0891d8f8 |
File details
Details for the file datacake-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: datacake-1.0.2-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2aaa11b2d6ce7576e92ce50a4c42f583496668fea626f015ecd1704cf44e94db |
|
MD5 | 3d63bf751b266738d26cb16c16b085e3 |
|
BLAKE2b-256 | a9f927ebe1253c2c45aa3a188581d384ed600fd0ab854fc74523bad54c00d19f |