Skip to main content

Utility library for data analysis.

Project description

<style> h1 { color: #3794FF; background-color: #00232B; border: hidden; font-size: 3em; text-align: center; padding-top: 25px; padding-bottom: 25px; margin: 0px; } h2 { font-size: 2em; margin-top: 20px; margin-bottom: 10px; } h3, h4 { font-size: 1.5em; margin-top: 20px; margin-bottom: 20px; } summary { font-size: 1.15em; font-weight: 600; margin-top: 20px; padding-bottom: 20px; line-height: 1.25; } ul { font-size: 1.15em; padding-top: 20px; padding-bottom: 10px; } ol { font-size: 1.15em; padding-bottom: 20px; } table { width: 100%; align: center; padding: 10 10 10 10; margin: 10 10 10 10; background-color: #00232B; } th { display: none; } tr { border: hidden; padding: 0 0 0 0; margin: 0 0 0 0; } td { border: hidden; padding: 0 0 0 0; margin: 0 0 0 0; } p { padding-left: 20px; padding-top: 10px; padding-bottom: 20px; } hr { margin-top: 0px; } blockquote { padding-bottom: 5px; } </style>

DataCake

_ _ _ _ _ _ _
PyPI Badge PyPI Version GitHub Badge GitHub Commits GitHub Open Issues GitHub Open Pulls Python Made
PyPI Month Downloads PyPI Status GitHub Hits GitHub Downloads GitHub Closed Issues GitHub Closed Pulls Codeium Built

Table of Contents

Introduction

  1. Features
  2. Motivation

Data

  1. A Sample
  2. Some Questions
  3. Some Answers
  4. The Context

Ingredients

  1. Flattening
  2. Scattering
  3. Spattering

Algorithms

  1. Index Bucketing

Cake

$~$

An Introduction

Features Checklist


  • [ ] 1. Flattens Deeply Nested Data

"How flat are we talking? It'll make your data flatter than a pancake!"


  • [ ] 2. Without Unnecessarily Duplicating Data

"So, I get the whole cake and nothing but the cake? No more and no less!"


  • [ ] 3. With No Loss of Information

"You can have your cake and eat it too? Every bit of it!"


  • [ ] 4. Using MongoDB-Style Syntax

"Is it a piece of cake? You can bet your buns!"


  • [ ] 5. Integrated with Numpy and Numba

"Is that a cherry on top? Why yes it is!"

$~$

The Data

To illustrate DataCake's features, I'll be utilizing a small sample from the [SQuAD][1] dataset.


A Sample
#####
# Simplified Sample of the SQuAD Dataset
# - Don't worry about analyzing this too much
# - We will break it down step-by-step
#####
data: dict = {
  "qas": [{
    "question": "In what country is Normandy located?",
    "answers": [{
      "text": "France",
      "answer_start": 159
    }],
    "is_impossible": False
  }, {
    "question": "When were the Normans in Normandy?",
    "answers": [{
      "text": "10th and 11th centuries",
      "answer_start": 94
    }, {
      "text": "in the 10th and 11th centuries",
      "answer_start": 87
    }],
    "is_impossible": False
  }, {
    "question": "From which countries did the Norse originate?",
    "answers": [{
      "text": "Denmark, Iceland and Norway",
      "answer_start": 256
    }],
    "is_impossible": False
  }, {
    "question": "Who was the Norse leader?",
    "answers": [{
      "text": "Rollo",
      "answer_start": 308
    }],
    "is_impossible": False
  }, {
    "question": "What century did the Normans first gain their separate identity?",
    "answers": [{
      "text": "10th century",
      "answer_start": 671
    }, {
      "text": "the first half of the 10th century",
      "answer_start": 649
    }, {
    "is_impossible": False
    }]
  }, {
    "plausible_answers": [{
      "text": "Normans",
      "answer_start": 4
    }],
    "question": "Who gave their name to Normandy in the 1000's and 1100's",
    "answers": [],
    "is_impossible": True
  }, {
    "plausible_answers": [{
      "text": "Normandy",
      "answer_start": 137
    }],
    "question": "What is France a region of?",
    "answers": [],
    "is_impossible": True
  }],
  "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
}

Some Questions
#####
# List of Sample Questions and If Answering is Impossible
# - For our needs, impossible questions are not desirable
#####
questions = [{
  "question": "In what country is Normandy located?",
  "is_impossible": False
}, {
  "question": "When were the Normans in Normandy?",
  "is_impossible": False
}, {
  "question": "From which countries did the Norse originate?",
  "is_impossible": False
}, {
  "question": "Who was the Norse leader?",
  "is_impossible": False
}, {
  "question": "What century did the Normans first gain their separate identity?",
  "is_impossible": False
}, {
  "question": "Who gave their name to Normandy in the 1000's and 1100's?",
  "is_impossible": True
}, {
  "question": "What is France a region of?",
  "is_impossible": True
}]

Some Answers
##### - Answers
# Here we have lists of answers from the possible questions.
# Note that some questions have multiple correct answers.
# Each answer also has its beginning index found in the context.
# We want each of these answers, but we can get rid of the indexes.
# Each answer needs to be associated to its appropriate question.
# This needs to be done without any unnecessary duplication.

answers = [
  # In what country is Normandy located?
  [{
    "text": "France",
    "answer_start": 159
  }],
  # When were the Normans in Normandy?
  [{
    "text": "10th and 11th centuries",
    "answer_start": 94
  }, {
    "text": "in the 10th and 11th centuries",
    "answer_start": 87
  }],
  # From which countries did the Norse originate?
  [{
    "text": "Denmark, Iceland and Norway",
    "answer_start": 256
  }],
  # Who was the Norse leader?
  [{
    "text": "Rollo",
    "answer_start": 308
  }],
  # What century did the Normans first gain their separate identity?
  [{
    "text": "10th century",
    "answer_start": 671
  }, {
    "text": "the first half of the 10th century",
    "answer_start": 649
  }]
]
##### - Plausible Answers
# These are the plausible answers given with the impossible questions.
# They do not adequately answer their questions.
# We only want good answers extracted from the context.

plausible = [
  # Who gave their name to Normandy in the 1000's and 1100's?
  [{
    "text": "Normans",
    "answer_start": 4
  }],
  # What is France a region of?
  [{
    "text": "Normandy",
    "answer_start": 137
  }]
]

The Context
##### - The Context
# This is where all of the questions and answers are derived from.
# Each record of data will need to access it.
# We want to do this without any unnecessary duplication.

context = "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."

$$ $$

The Ingredients

Need to write something.


Flattening
##### - Ingredients After Flattening
# This is a sample of the special keys from the sample data.
# Ingredients begin to take shape after the first flattening step.
# The integer values represent indexes as they are derived from the data.
# The string values represent the features derived from the data.
# Each data value has an tuple of these ingredient keys.
data = { "data": [{
  "qas": [{
    "question": "In what country is Normandy located?", ((0, "data"), (0, "qas"), (0, "question"))
    "answers": [{
      "text": "France", ((0, "data"), (0, "qas"), (0, "answers"), (0, "text"))
      "answer_start": 159 ((0, "data"), (0, "qas"), (0, "answers"), (0, "answer_start"))
    }],
    "is_impossible": False ((0, "data"), (0, "qas"), (0, "is_impossible"))
  }, {
    "question": "When were the Normans in Normandy?",
    "answers": [{
      "text": "10th and 11th centuries",
      "answer_start": 94
    }, {
      "text": "in the 10th and 11th centuries",
      "answer_start": 87
    }],
    "is_impossible": False
  }, {
    "question": "From which countries did the Norse originate?",
    "answers": [{
      "text": "Denmark, Iceland and Norway",
      "answer_start": 256
    }],
    "is_impossible": False
  }, {
    "question": "Who was the Norse leader?",
    "answers": [{
      "text": "Rollo",
      "answer_start": 308
    }],
    "is_impossible": False
  }, {
    "question": "What century did the Normans first gain their separate identity?",
    "answers": [{
      "text": "10th century",
      "answer_start": 671
    }, {
      "text": "the first half of the 10th century",
      "answer_start": 649
    }, {
    "is_impossible": False
    }]
  }, {
    "plausible_answers": [{
      "text": "Normans",
      "answer_start": 4
    }],
    "question": "Who gave their name to Normandy in the 1000's and 1100's",
    "answers": [],
    "is_impossible": True
  }, {
    "plausible_answers": [{
      "text": "Normandy",
      "answer_start": 137
    }],
    "question": "What is France a region of?",
    "answers": [],
    "is_impossible": True
  }],
  "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries." ((0, "data"), (-1, "context"))
}]}
ingredients: list[tuple[int, str]] = [
  ((-1, "context"),),
  ((0, "paragraphs"), (0, "questions"), (0, "q")),
  ((0, "paragraphs"), (0, "questions"), (0, "a")),
  ((0, "paragraphs"), (0, "questions"), (1, "q")),
  ((0, "paragraphs"), (0, "questions"), (1, "a")),
  ((0, "paragraphs"), (1, "questions"), (0, "q")),
  ((0, "paragraphs"), (1, "questions"), (0, "a")),
  ((0, "paragraphs"), (1, "questions"), (1, "q")),
  ((0, "paragraphs"), (1, "questions"), (1, "a")),
  ((0, "paragraphs"), (2, "questions"), (0, "q")),
  ((0, "paragraphs"), (2, "questions"), (0, "a")),
  ((0, "paragraphs"), (2, "questions"), (1, "q")),
  ((0, "paragraphs"), (2, "questions"), (1, "a")),
]

Back to Top


https://dl.ucsc.cmb.ac.lk/jspui/handle/123456789/4216

  • Motivation from NIH Preliminary Data Post-Submission Update

  • Challenges

    1. Skewness
      • Load Balancing Problem
      • Not Very Well Explored
    2. Programming Mismatch
    3. Information Loss
    4. Partitioning
      • Data Distribution Problem
      • Data Duplication Problem
  • Solutions

    1. Shredding
    2. Flatten
      • Map
      • FlatMap
      • Filter
  • New Approach: Index Bucketing

    • Overview
      1. Trees
      2. Branches
      3. Leaves
  • Setup

    1. Nested Data to Tree
    2. Branches know their leaves
    3. Leaves know their branches
  • Flattening Evaluations

    • Index Bucketing
    • Recursive Mapping
    • Pandas Explosion

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacake-1.0.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

datacake-1.0.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file datacake-1.0.0.tar.gz.

File metadata

  • Download URL: datacake-1.0.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for datacake-1.0.0.tar.gz
Algorithm Hash digest
SHA256 60cd9a4d5280414752c0aeb082e2f0a1eb8cea8f5d777088722cd73abeab8aeb
MD5 6e389f6b20758b23442d1a5cd9c7b41e
BLAKE2b-256 ea5430c16032fd890d46489bdad5cf9fb1e8900279b37471ec1f7ba33e47fdaa

See more details on using hashes here.

File details

Details for the file datacake-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: datacake-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for datacake-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7926344de5625d9211fb1922f91a0737bd0fc84577efc4ab2ba268bb8ea47e7d
MD5 8b69470269be77d12f3967872d29a8dc
BLAKE2b-256 4cb3d80522cea62042496a42c8de97d4291a7e9e5ef4c4b2b5276e979e8a4f19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page