Streaming big data analysis, probability prediction, and code knowledge assistant toolkit.

These details have not been verified by PyPI

Project description

bigdata-kit

bigdata-kit은 JSON, JSONL, CSV 같은 대용량 데이터를 한 번에 메모리에 올리지 않고 스트리밍 방식으로 분석하는 Python 라이브러리입니다. 데이터 빈도와 조건부 확률을 기반으로 간단한 예측을 수행하고, 함수/클래스/코드 패턴/에러 해결 사례를 JSONL 지식 데이터로 쌓아 프로그래밍 보조 기능도 제공합니다.

이 프로젝트는 진짜 딥러닝 AI 모델이 아닙니다. 대신 많은 데이터가 쌓일수록 검색, 통계, 조건부 확률, 패턴 분석이 더 정확해지는 빅데이터 기반 도구입니다.

pyproject.toml 기반 개발

프로젝트는 src layout과 PEP 621 형식의 pyproject.toml을 사용합니다.

배포 이름: bigdata-kit
import 패키지: bigdata_kit
CLI 명령어: bigdata
빌드 백엔드: hatchling
Python: 3.10 이상

초기 요청의 moduonbigdata import 이름도 호환되도록 별칭 패키지를 제공합니다.

from bigdata_kit import DataAnalyzer
from moduonbigdata import DataAnalyzer as CompatDataAnalyzer

CLI 사용법

bigdata count examples/sample.jsonl
bigdata fields examples/sample.jsonl
bigdata sample examples/sample.jsonl --n 5
bigdata freq examples/sample.jsonl WinValue
bigdata stats examples/sample.jsonl Score
bigdata top examples/sample.jsonl Score --n 10
bigdata where examples/sample.jsonl "Score > 50"
bigdata predict examples/sample.jsonl WinValue
bigdata predict examples/sample.jsonl WinValue --where "Player=블레카,Map=sky"

CLI 출력은 Rich Table을 사용합니다. 파일이 없거나 조건식이 잘못된 경우 한국어 오류 메시지를 출력합니다.

Python API 사용법

from bigdata_kit import CodeAssistant, DataAnalyzer

analyzer = DataAnalyzer("examples/sample.jsonl")

print(analyzer.count())
print(analyzer.fields())
print(analyzer.sample(5))
print(analyzer.frequency("WinValue"))
print(analyzer.numeric_stats("Score"))
print(analyzer.top("Score", n=10))
print(analyzer.where("Score > 50"))
print(analyzer.predict("WinValue"))
print(analyzer.predict("WinValue", where={"Player": "블레카"}))

assistant = CodeAssistant("examples/code_knowledge.jsonl")

print(assistant.suggest_function("json 파일 읽기"))
print(assistant.explain_function("read_jsonl"))
print(assistant.fix_error("ModuleNotFoundError: No module named ijson"))
print(assistant.generate_example("jsonl 파일에서 승률 계산"))

JSONL 데이터 예시

{"Player":"블레카","Enemy":"레카","Map":"sky","WinValue":"이김","Time":"밤","Score":95}
{"Player":"블레카","Enemy":"레카","Map":"desert","WinValue":"짐","Time":"낮","Score":40}

JSONL은 한 줄에 JSON 객체 하나를 저장합니다. read_jsonl()은 줄 단위로 읽기 때문에 큰 파일도 순차적으로 처리할 수 있습니다.

CSV 데이터 예시

Player,Enemy,Map,WinValue,Time,Score
블레카,레카,sky,이김,밤,95
블레카,레카,desert,짐,낮,40

CSV는 pandas chunksize를 사용해 chunk 단위로 읽고, 최종적으로 row dict를 하나씩 yield합니다.

확률 기반 예측 방식

DataAnalyzer.predict(target)은 target 필드의 전체 빈도를 기반으로 확률 순위를 계산합니다.

DataAnalyzer.predict(target, where={...})은 먼저 조건과 정확히 일치하는 데이터만 모아 조건부 빈도를 계산합니다. 조건부 데이터가 너무 적으면 smoothing을 적용한 categorical Naive Bayes 방식으로 각 조건의 영향을 추정합니다.

반환 구조는 다음과 같습니다.

{
    "target": "WinValue",
    "prediction": "이김",
    "confidence": 0.72,
    "method": "conditional_frequency",
    "matched_records": 153,
    "total_records": 1000,
    "ranking": [
        {"value": "이김", "probability": 0.72, "count": 110},
        {"value": "짐", "probability": 0.28, "count": 43},
    ],
}

CodeAssistant 기능

CodeAssistant는 함수, 클래스, 코드 패턴, 에러 해결 사례가 저장된 JSONL 파일을 검색합니다. 임베딩 모델 없이도 name, description, tags, example, error message, usage_count를 사용해 관련도를 계산합니다.

bigdata code suggest examples/code_knowledge.jsonl "json 파일 읽기"
bigdata code explain examples/code_knowledge.jsonl read_jsonl
bigdata code patterns examples/code_knowledge.jsonl "json 분석"
bigdata code fix examples/code_knowledge.jsonl "ModuleNotFoundError: No module named ijson"
bigdata code example examples/code_knowledge.jsonl "jsonl 파일에서 승률 계산"

Python 프로젝트 인덱싱

AST 기반 인덱서로 Python 프로젝트의 함수와 클래스를 JSONL 지식 데이터셋으로 만들 수 있습니다.

bigdata code index ./src --output examples/code_knowledge.jsonl

인덱서는 다음 정보를 추출합니다.

함수 이름, docstring, 매개변수, 반환 타입
클래스 이름, docstring, 메서드 목록
import 목록
검색에 사용할 기본 tags

코드 지식 데이터셋 구조

함수 데이터:

{
  "type": "function",
  "language": "python",
  "name": "read_jsonl",
  "description": "JSONL 파일을 줄 단위로 읽는 함수",
  "parameters": [
    { "name": "path", "type": "str", "description": "읽을 파일 경로" }
  ],
  "returns": "Iterator[dict]",
  "tags": ["json", "jsonl", "streaming"],
  "example": "for row in read_jsonl('data.jsonl'):\n    print(row)",
  "usage_count": 10
}

에러 해결 데이터:

{
  "type": "error_solution",
  "language": "python",
  "error": "ModuleNotFoundError",
  "message": "No module named ijson",
  "cause": "ijson 패키지가 설치되지 않았습니다.",
  "solution": "pip install ijson",
  "tags": ["python", "package", "install", "ijson"],
  "usage_count": 5
}

테스트 실행

pytest

ruff/mypy 실행

ruff check .
mypy src

추후 확장 아이디어

Parquet reader
SQLite/DuckDB 기반 로컬 분석 엔진
SQLite FTS 또는 DuckDB FTS 검색
sentence-transformers 또는 OpenAI embeddings 기반 의미 검색
대용량 예측 결과 캐싱
다중 조건식 AND/OR parser
데이터 프로파일 HTML 리포트 생성

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 30, 2026

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigdata_kit-0.1.1.tar.gz (20.4 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigdata_kit-0.1.1-py3-none-any.whl (25.7 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file bigdata_kit-0.1.1.tar.gz.

File metadata

Download URL: bigdata_kit-0.1.1.tar.gz
Upload date: May 30, 2026
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for bigdata_kit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8b6d60cd136cfc0a9b9c4f4a525cbaa509ae5c0f36d368538a0c3b4dd29f4024`
MD5	`921d17d89163eaa26de9b6fd08d41cf1`
BLAKE2b-256	`af8ee589e140e22b42f1e0d7e678948ed2a5e323dadffa971f844f056e821ef4`

See more details on using hashes here.

File details

Details for the file bigdata_kit-0.1.1-py3-none-any.whl.

File metadata

Download URL: bigdata_kit-0.1.1-py3-none-any.whl
Upload date: May 30, 2026
Size: 25.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for bigdata_kit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d2389de0199d77ef0a843fe94c38e02ec770166516d42450caef1674836f781`
MD5	`9f17d59ebc806f3b3117a129807c6955`
BLAKE2b-256	`7db3b1a16a027dcf0b9f2f20fef9567c38b22c6c4558873b39baf8292b5800e8`

See more details on using hashes here.

bigdata-kit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

bigdata-kit

pyproject.toml 기반 개발

CLI 사용법

Python API 사용법

JSONL 데이터 예시

CSV 데이터 예시

확률 기반 예측 방식

CodeAssistant 기능

Python 프로젝트 인덱싱

코드 지식 데이터셋 구조

테스트 실행

ruff/mypy 실행

추후 확장 아이디어

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes