Skip to main content

데이터셋 정리 및 시각화 라이브러리

Project description

clean-dataset

데이터셋 정리 및 시각화 라이브러리

train/val/test로 구성된 비디오/이미지 데이터셋을 정리하고 시각화하는 도구입니다.

설치

pip install clean-dataset

사용법

from clean_dataset import image, split, mask_semantic, mask_instance, mask_panoptic, make_json, visualize

1. 이미지/마스크 폴더 구조 평탄화

# JPEGImages/video_id/frame.jpg -> images/video_id@frame.jpg
image("dataset")

# video_id/image/frame.jpg -> images/video_id@frame.jpg
image("dataset", source="image")

# video_id/mask/frame.png -> masks/video_id@frame.png
image("dataset", source="mask")

# 특정 split만 처리
image("dataset", splits=["train", "val", "test"])

2. 데이터셋 분할 (txt 파일 기반)

split("data", train="train.txt", val="val.txt", test="test.txt")
# 결과: train/, val/, test/ 폴더 생성

3. 마스크 생성

# 어노테이션 JSON에서 생성
mask_semantic("dataset", {
    "train": "train.json",
    "val": "val.json",
    "test": "test.json",  # test는 있으면 넣고 없으면 생략
})
mask_instance("dataset", {"train": "train.json", "val": "val.json"})
mask_panoptic("dataset", {"train": "train.json", "val": "val.json"})
# 결과: dataset/train/mask_semantic/, dataset/train/mask_instance/, ...

# RGB 마스크 폴더에서 생성
# 구조 1: split/panomasksRGB/video_id/*.png
mask_instance("dataset", {
    "train": "panomasksRGB",
    "val": "panomasksRGB",
}, from_rgb=True)

# 구조 2: split/data/video_id/mask/*.png
mask_instance("dataset", {
    "train": "data",
    "val": "data",
}, from_rgb=True, mask_folder="mask")

4. JSON 생성

# 어노테이션 JSON에서 bbox JSON 생성
make_json("dataset", {
    "train": ["train/instances.json"],
    "val": ["val/instances.json"],
    "test": ["test/instances.json"],
})
# 결과: dataset/train/json/, dataset/val/json/

# 마스크에서 직접 bbox 계산
make_json("dataset", {
    "train": "mask_instance",
    "val": "mask_instance",
}, from_mask=True)

# 카테고리 JSON 생성 (어노테이션에서 추출)
make_json("dataset", {
    "train": ["train.json"],
}, category=True, output_name="VIPSeg.json")

출력 JSON 형식 (bbox):

[
    {"id": 1, "tag": "person", "xmin": 701, "ymin": 0, "xmax": 1280, "ymax": 709},
    {"id": 2, "tag": "car", "xmin": 100, "ymin": 200, "xmax": 500, "ymax": 600}
]

출력 JSON 형식 (category, isthing 있을 때):

{
    "names": {"person": "thing", "sky": "stuff", ...},
    "ignore": 255
}

출력 JSON 형식 (category, isthing 없을 때):

{
    "names": ["person", "car", "tree", ...],
    "ignore": 255
}

5. 시각화

visualize("dataset")
# 결과: examples/semantic/, examples/instance/, examples/panoptic/

visualize("dataset", splits=["train", "val", "test"], max_samples=10)

지원 형식

  • 입력: YouTube-VOS, VIPSeg, COCO 형식의 JSON
  • 마스크: RLE 인코딩, RGB panoptic 마스크
  • 폴더 구조:
    • split/JPEGImages/video_id/frame.jpg
    • split/video_id/image/frame.jpg
    • split/video_id/mask/frame.png

의존성

  • numpy >= 1.20.0
  • Pillow >= 8.0.0
  • opencv-python >= 4.5.0
  • sanghyunjo

라이선스

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_dataset-0.3.0.tar.gz (28.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clean_dataset-0.3.0-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file clean_dataset-0.3.0.tar.gz.

File metadata

  • Download URL: clean_dataset-0.3.0.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for clean_dataset-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2c7d6214d3a865c9fa1da91b2a98bccc5f18b781d1a71e0a58c392ef232bf674
MD5 be62f8f95378b2da536881d3d1af20ac
BLAKE2b-256 81bb2d0d240362a50d15f568dbb9a2e8ae09ded7944628d2984fb7236dc970fd

See more details on using hashes here.

File details

Details for the file clean_dataset-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: clean_dataset-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for clean_dataset-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4a067346bc6000f5bc8090476c98082c1c5e15ab46409627cd8ed504861a7979
MD5 aa9f43d45ca559dcb1a81da91dea304e
BLAKE2b-256 c3b6c9979b040b2adaa5eee2fef05e80d6d32332a9312f23fa95fc3df9d84ca3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page