Skip to main content

Utility library for data analysis.

Project description

<style> h1 { color: #3794FF; background-color: #00232B; border: hidden; font-size: 3em; text-align: center; padding-top: 25px; padding-bottom: 25px; margin: 0px; } h2 { font-size: 2em; margin-top: 20px; margin-bottom: 10px; } h3, h4 { font-size: 1.5em; margin-top: 20px; margin-bottom: 20px; } summary { font-size: 1.15em; font-weight: 600; margin-top: 20px; padding-bottom: 20px; line-height: 1.25; } ul { font-size: 1.15em; padding-top: 20px; padding-bottom: 10px; } ol { font-size: 1.15em; padding-bottom: 20px; } table { width: 100%; align: center; padding: 10 10 10 10; margin: 10 10 10 10; background-color: #00232B; } th { display: none; } tr { border: hidden; padding: 0 0 0 0; margin: 0 0 0 0; } td { border: hidden; padding: 0 0 0 0; margin: 0 0 0 0; } p { padding-left: 20px; padding-top: 10px; padding-bottom: 20px; } hr { margin-top: 0px; } blockquote { padding-bottom: 5px; } </style>

DataCake

_ _ _ _ _ _ _
PyPI Badge PyPI Version GitHub Badge GitHub Commits GitHub Open Issues GitHub Open Pulls Python Made
PyPI Month Downloads PyPI Status GitHub Hits GitHub Downloads GitHub Closed Issues GitHub Closed Pulls Codeium Built

Table of Contents

Introduction

  1. Features
  2. Motivation

Data

  1. A Sample
  2. Some Questions
  3. Some Answers
  4. The Context

Ingredients

  1. Flattening
  2. Scattering
  3. Spattering

Algorithms

  1. Index Bucketing

Cake

$~$

An Introduction

Features Checklist


  • [ ] 1. Flattens Deeply Nested Data

"How flat are we talking? It'll make your data flatter than a pancake!"


  • [ ] 2. Without Unnecessarily Duplicating Data

"So, I get the whole cake and nothing but the cake? No more and no less!"


  • [ ] 3. With No Loss of Information

"You can have your cake and eat it too? Every bit of it!"


  • [ ] 4. Using MongoDB-Style Syntax

"Is it a piece of cake? You can bet your buns!"


  • [ ] 5. Integrated with Numpy and Numba

"Is that a cherry on top? Why yes it is!"

$~$

The Data

To illustrate DataCake's features, I'll be utilizing a small sample from the [SQuAD][1] dataset.


A Sample
#####
# Simplified Sample of the SQuAD Dataset
# - Don't worry about analyzing this too much
# - We will break it down step-by-step
#####
data: dict = {
  "qas": [{
    "question": "In what country is Normandy located?",
    "answers": [{
      "text": "France",
      "answer_start": 159
    }],
    "is_impossible": False
  }, {
    "question": "When were the Normans in Normandy?",
    "answers": [{
      "text": "10th and 11th centuries",
      "answer_start": 94
    }, {
      "text": "in the 10th and 11th centuries",
      "answer_start": 87
    }],
    "is_impossible": False
  }, {
    "question": "From which countries did the Norse originate?",
    "answers": [{
      "text": "Denmark, Iceland and Norway",
      "answer_start": 256
    }],
    "is_impossible": False
  }, {
    "question": "Who was the Norse leader?",
    "answers": [{
      "text": "Rollo",
      "answer_start": 308
    }],
    "is_impossible": False
  }, {
    "question": "What century did the Normans first gain their separate identity?",
    "answers": [{
      "text": "10th century",
      "answer_start": 671
    }, {
      "text": "the first half of the 10th century",
      "answer_start": 649
    }, {
    "is_impossible": False
    }]
  }, {
    "plausible_answers": [{
      "text": "Normans",
      "answer_start": 4
    }],
    "question": "Who gave their name to Normandy in the 1000's and 1100's",
    "answers": [],
    "is_impossible": True
  }, {
    "plausible_answers": [{
      "text": "Normandy",
      "answer_start": 137
    }],
    "question": "What is France a region of?",
    "answers": [],
    "is_impossible": True
  }],
  "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
}

Some Questions
#####
# List of Sample Questions and If Answering is Impossible
# - For our needs, impossible questions are not desirable
#####
questions = [{
  "question": "In what country is Normandy located?",
  "is_impossible": False
}, {
  "question": "When were the Normans in Normandy?",
  "is_impossible": False
}, {
  "question": "From which countries did the Norse originate?",
  "is_impossible": False
}, {
  "question": "Who was the Norse leader?",
  "is_impossible": False
}, {
  "question": "What century did the Normans first gain their separate identity?",
  "is_impossible": False
}, {
  "question": "Who gave their name to Normandy in the 1000's and 1100's?",
  "is_impossible": True
}, {
  "question": "What is France a region of?",
  "is_impossible": True
}]

Some Answers
##### - Answers
# Here we have lists of answers from the possible questions.
# Note that some questions have multiple correct answers.
# Each answer also has its beginning index found in the context.
# We want each of these answers, but we can get rid of the indexes.
# Each answer needs to be associated to its appropriate question.
# This needs to be done without any unnecessary duplication.

answers = [
  # In what country is Normandy located?
  [{
    "text": "France",
    "answer_start": 159
  }],
  # When were the Normans in Normandy?
  [{
    "text": "10th and 11th centuries",
    "answer_start": 94
  }, {
    "text": "in the 10th and 11th centuries",
    "answer_start": 87
  }],
  # From which countries did the Norse originate?
  [{
    "text": "Denmark, Iceland and Norway",
    "answer_start": 256
  }],
  # Who was the Norse leader?
  [{
    "text": "Rollo",
    "answer_start": 308
  }],
  # What century did the Normans first gain their separate identity?
  [{
    "text": "10th century",
    "answer_start": 671
  }, {
    "text": "the first half of the 10th century",
    "answer_start": 649
  }]
]
##### - Plausible Answers
# These are the plausible answers given with the impossible questions.
# They do not adequately answer their questions.
# We only want good answers extracted from the context.

plausible = [
  # Who gave their name to Normandy in the 1000's and 1100's?
  [{
    "text": "Normans",
    "answer_start": 4
  }],
  # What is France a region of?
  [{
    "text": "Normandy",
    "answer_start": 137
  }]
]

The Context
##### - The Context
# This is where all of the questions and answers are derived from.
# Each record of data will need to access it.
# We want to do this without any unnecessary duplication.

context = "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."

$$ $$

The Ingredients

Need to write something.


Flattening
##### - Ingredients After Flattening
# This is a sample of the special keys from the sample data.
# Ingredients begin to take shape after the first flattening step.
# The integer values represent indexes as they are derived from the data.
# The string values represent the features derived from the data.
# Each data value has an tuple of these ingredient keys.
data = { "data": [{
  "qas": [{
    "question": "In what country is Normandy located?", ((0, "data"), (0, "qas"), (0, "question"))
    "answers": [{
      "text": "France", ((0, "data"), (0, "qas"), (0, "answers"), (0, "text"))
      "answer_start": 159 ((0, "data"), (0, "qas"), (0, "answers"), (0, "answer_start"))
    }],
    "is_impossible": False ((0, "data"), (0, "qas"), (0, "is_impossible"))
  }, {
    "question": "When were the Normans in Normandy?",
    "answers": [{
      "text": "10th and 11th centuries",
      "answer_start": 94
    }, {
      "text": "in the 10th and 11th centuries",
      "answer_start": 87
    }],
    "is_impossible": False
  }, {
    "question": "From which countries did the Norse originate?",
    "answers": [{
      "text": "Denmark, Iceland and Norway",
      "answer_start": 256
    }],
    "is_impossible": False
  }, {
    "question": "Who was the Norse leader?",
    "answers": [{
      "text": "Rollo",
      "answer_start": 308
    }],
    "is_impossible": False
  }, {
    "question": "What century did the Normans first gain their separate identity?",
    "answers": [{
      "text": "10th century",
      "answer_start": 671
    }, {
      "text": "the first half of the 10th century",
      "answer_start": 649
    }, {
    "is_impossible": False
    }]
  }, {
    "plausible_answers": [{
      "text": "Normans",
      "answer_start": 4
    }],
    "question": "Who gave their name to Normandy in the 1000's and 1100's",
    "answers": [],
    "is_impossible": True
  }, {
    "plausible_answers": [{
      "text": "Normandy",
      "answer_start": 137
    }],
    "question": "What is France a region of?",
    "answers": [],
    "is_impossible": True
  }],
  "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries." ((0, "data"), (-1, "context"))
}]}
ingredients: list[tuple[int, str]] = [
  ((-1, "context"),),
  ((0, "paragraphs"), (0, "questions"), (0, "q")),
  ((0, "paragraphs"), (0, "questions"), (0, "a")),
  ((0, "paragraphs"), (0, "questions"), (1, "q")),
  ((0, "paragraphs"), (0, "questions"), (1, "a")),
  ((0, "paragraphs"), (1, "questions"), (0, "q")),
  ((0, "paragraphs"), (1, "questions"), (0, "a")),
  ((0, "paragraphs"), (1, "questions"), (1, "q")),
  ((0, "paragraphs"), (1, "questions"), (1, "a")),
  ((0, "paragraphs"), (2, "questions"), (0, "q")),
  ((0, "paragraphs"), (2, "questions"), (0, "a")),
  ((0, "paragraphs"), (2, "questions"), (1, "q")),
  ((0, "paragraphs"), (2, "questions"), (1, "a")),
]

Back to Top


https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4216

  • Motivation from NIH Preliminary Data Post-Submission Update

  • Challenges

    1. Skewness
      • Load Balancing Problem
      • Not Very Well Explored
    2. Programming Mismatch
    3. Information Loss
    4. Partitioning
      • Data Distribution Problem
      • Data Duplication Problem
  • Solutions

    1. Shredding
    2. Flatten
      • Map
      • FlatMap
      • Filter
  • New Approach: Index Bucketing

    • Overview
      1. Trees
      2. Branches
      3. Leaves
  • Setup

    1. Nested Data to Tree
    2. Branches know their leaves
    3. Leaves know their branches
  • Flattening Evaluations

    • Index Bucketing
    • Recursive Mapping
    • Pandas Explosion

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacake-1.0.2.tar.gz (10.4 kB view hashes)

Uploaded Source

Built Distribution

datacake-1.0.2-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page