Skip to main content

A pathlib.Path class for dapla

Project description

dapla-path

pathlib.Path for dapla

Opprettet av: ort ort@ssb.no


Path (dapla)

import dapla as dp
import pandas as pd

from daplapath.path import Path
folder = Path('ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024')
folder
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024'

Fungerer som tekst

folder.startswith("ssb")
True
dp.FileClient.get_gcs_file_system().exists(folder)
True

Med metoder og attributter ala pathlib.Path

folder.exists()
True
folder.is_dir()
True
file = folder / "ABAS_kommune_utenhav_p2024_v1.parquet"
file
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
file.parent
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024'

Og noen pandas attributter

Uten å lese filen

file.columns
Index(['OBJTYPE', 'NAVN', 'KOMMUNENR', 'FYLKE', 'AREAL_GDB', 'SHAPE_Length',
       'SHAPE_Area', 'geometry'],
      dtype='object')
file.dtypes
OBJTYPE         string
NAVN            string
KOMMUNENR       string
FYLKE           string
AREAL_GDB       double
SHAPE_Length    double
SHAPE_Area      double
geometry        binary
dtype: object
file.shape
(481, 8)

Versjonering

file.version_number
1
print(file.versions())
timestamp            mb (int)
2024-05-19 12:31:02  941            .../ABAS_kommune_utenhav_p2024.parquet
2024-08-16 16:15:10  941         .../ABAS_kommune_utenhav_p2024_v1.parquet
Name: path, dtype: object
file.latest_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
file.highest_numbered_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
# highest_numbered_version + 1
file.new_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v2.parquet'
# alltid False
file.new_version().exists()
False
# finner/fjerner versjonsnummer med regex-søk
file._version_pattern
'_v(\\d+)'

Branch tree

Filtre med hyperlenke. Gjør at man kopierer stien når man klikker på den.

print(
    Path("ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data").tree()
)
ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data /
    └──2000 /
        └──SSB_tettsted_flate_p2000.parquet
        └──SSB_tettsted_flate_p2000_v1.parquet
    └──2002 /
        └──SSB_tettsted_flate_p2002.parquet
        └──SSB_tettsted_flate_p2002_v1.parquet
    └──2003 /
        └──SSB_tettsted_flate_p2003.parquet
        └──SSB_tettsted_flate_p2003_v1.parquet
    └──2004 /
        └──SSB_tettsted_flate_p2004.parquet
        └──SSB_tettsted_flate_p2004_v1.parquet
    └──2005 /
        └──SSB_tettsted_flate_p2005.parquet
        └──SSB_tettsted_flate_p2005_v1.parquet
    └──2006 /
        └──SSB_tettsted_flate_p2006.parquet
        └──SSB_tettsted_flate_p2006_v1.parquet
    └──2007 /
        └──SSB_tettsted_flate_p2007.parquet
        └──SSB_tettsted_flate_p2007_v1.parquet
    └──2008 /
        └──SSB_tettsted_flate_p2008.parquet
        └──SSB_tettsted_flate_p2008_v1.parquet
        └──SSB_tettsted_ringbuffer_p2008.parquet
        └──(...)
    └──2009 /
        └──SSB_tettsted_flate_p2009.parquet
        └──SSB_tettsted_flate_p2009_v1.parquet
    └──2010 /
        └──SOL_arealressurs_flate_p2010.parquet
        └──SOL_arealressurs_flate_p2010_v1.parquet
    └──2011 /
        └──SOL_Arstat_flate_p2011.parquet
        └──SOL_Arstat_flate_p2011_v1.parquet
        └──SSB_tettsted_flate_p2011.parquet
        └──(...)
    └──2012 /
        └──ABAS_fylke_flate_p2012_v1.parquet
        └──ABAS_fylke_linje_p2012_v1.parquet
        └──ABAS_grunnkrets_flate_p2012_v1.parquet
        └──(...)
    └──2013 /
        └──ABAS_fylke_flate_p2013_v1.parquet
        └──ABAS_kommune_flate_p2013_v1.parquet
        └──DEK_eiendom_flate_p2013_v1.parquet
        └──(...)
    └──2014 /
        └──DEK_eiendom_flate_p2014_v1.parquet
        └──FKB_anlegg_flate_p2014_v1.parquet
        └──FKB_anlegg_linje_p2014_v1.parquet
        └──(...)
    └──2015 /
        └──ABAS_grunnkrets_flate_p2015_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2015_v1.parquet
        └──ABAS_kommune_flate_p2015_v1.parquet
        └──(...)
    └──2016 /
        └──ABAS_fylke_flate_p2016_v1.parquet
        └──ABAS_grunnkrets_flate_p2016_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2016_v1.parquet
        └──(...)
    └──2017 /
        └──ABAS_fylke_flate_p2017_v1.parquet
        └──ABAS_grunnkrets_flate_p2017_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2017_v1.parquet
        └──(...)
    └──2018 /
        └──ABAS_fylke_flate_p2018_v1.parquet
        └──ABAS_grunnkrets_flate_p2018_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2018_v1.parquet
        └──(...)
    └──2019 /
        └──ABAS_fylke_flate_p2019_v1.parquet
        └──ABAS_grunnkrets_flate_p2019_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2019_v1.parquet
        └──(...)
    └──2020 /
        └──ABAS_fylke_flate_p2020_v1.parquet
        └──ABAS_grunnkrets_flate_p2020_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2020_v1.parquet
        └──(...)
    └──2021 /
        └──ABAS_fylke_flate_p2021_v1.parquet
        └──ABAS_grunnkrets_flate_p2021_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2021_v1.parquet
        └──(...)
    └──2022 /
        └──ABAS_fylke_flate_p2022_v1.parquet
        └──ABAS_grunnkrets_flate_p2022_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2022_v1.parquet
        └──(...)
    └──2023 /
        └──ABAS_KnrGamle_p2023_v1.parquet
        └──ABAS_fylke_flate_p2023_v1.parquet
        └──ABAS_grunnkrets_flate_p2023_v1.parquet
        └──(...)
    └──2024 /
        └──ABAS_fylke_flate_p2024_v1.parquet
        └──ABAS_grunnkrets_flate_p2024_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2024_v1.parquet
        └──(...)

ls - få filstier, timestamp og størrelse

Med stier som kopieres (som ctrl + c) når man klipper på stien.

files_in_dir = file.parent.ls()
print(files_in_dir)
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-04-19 11:45:47  0                    .../N50_JernbaneStasjon_punkt_p2024.parquet
                     0                 .../N50_JernbaneStasjon_punkt_p2024_v1.parquet
                     0                           .../N50_lufthavn_punkt_p2024.parquet
                     0                        .../N50_lufthavn_punkt_p2024_v1.parquet
                                                         ...                         
2024-08-21 14:47:12  861                              .../SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../NVDB_veg_linje_p2024_v1.parquet
Name: path, Length: 127, dtype: object
# subclass av pandas.Series
type(files_in_dir)
daplapath.path.PathSeries
print(files_in_dir.loc[lambda x: x.gb > 10].keep_latest_versions())
timestamp            mb (int)
2024-07-18 00:13:09  17646        .../FKB_arealressurs_flate_p2024_v1.parquet
2024-08-20 14:03:16  19717       .../FKB_gronnstruktur_flate_p2024_v1.parquet
Name: path, dtype: object
# stiene er fortsatt Path
type(files_in_dir.iloc[0])
daplapath.path.Path
# velg ut filene
print(folder.ls().files)
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-04-19 11:45:47  0                    .../N50_JernbaneStasjon_punkt_p2024.parquet
                     0                 .../N50_JernbaneStasjon_punkt_p2024_v1.parquet
                     0                           .../N50_lufthavn_punkt_p2024.parquet
                     0                        .../N50_lufthavn_punkt_p2024_v1.parquet
                                                         ...                         
2024-08-21 14:47:12  861                              .../SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../NVDB_veg_linje_p2024_v1.parquet
Name: path, Length: 127, dtype: object
print(folder.ls().dirs)
Series([], Name: path, dtype: object)
# samme som .loc med x.str.contains
print(folder.ls().containing("kommune"))
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-05-19 12:31:02  941                       .../ABAS_kommune_utenhav_p2024.parquet
2024-06-24 14:25:14  11                          .../ABAS_kommune_flate_p2024.parquet
2024-08-16 16:15:10  941                    .../ABAS_kommune_utenhav_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
Name: path, dtype: object
print(file.parent.parent.ls(recursive=True).files)
timestamp            mb (int)
2024-04-19 11:43:21  0                 .../2022/N50_JernbaneStasjon_punkt_p2022_v1.parquet
2024-04-19 11:43:22  0                        .../2022/N50_lufthavn_punkt_p2022_v1.parquet
2024-04-19 11:43:23  0                      .../2022/NVE_Vindturbin_punkt_p2022_v1.parquet
                     0                    .../2022/NVE_Trafostasjon_punkt_p2022_v1.parquet
2024-04-19 11:43:24  0                     .../2022/S100_TekniskSit_flate_p2022_v1.parquet
                                                           ...                            
2024-08-21 14:47:12  861                              .../2024/SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../2024/SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../2024/SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../2024/SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../2024/NVDB_veg_linje_p2024_v1.parquet
Length: 1323, dtype: object

Write to testpath

testpath = Path('ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet')

# delete files first
for version in testpath.versions():
    version.rm_file()

testpath.exists()
False
df = pd.DataFrame({"x": [1,2,3], "y": [*"abc"]})

dp.write_pandas(df, testpath)

testpath.exists()
True
testpath.latest_version()
'ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet'
# highest_numbered_version + 1
testpath.new_version()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[31], line 2
      1 # highest_numbered_version + 1
----> 2 testpath.new_version()


File ~/daplapath/daplapath/path.py:805, in Path.new_version(self, timeout)
    803     time_should_be_at_least = pd.Timestamp.now() - pd.Timedelta(minutes=timeout)
    804     if timestamp[0] > time_should_be_at_least:
--> 805         raise ValueError(
    806             f"Latest version of the file was updated {timestamp[0]}, which "
    807             f"is less than the timeout period of {timeout} minutes. "
    808             "Change the timeout argument, but be sure to not save new "
    809             "versions in a loop."
    810         )
    812 return highest_numbered.add_to_version_number(1)


ValueError: Latest version of the file was updated 2024-08-28 15:09:47, which is less than the timeout period of 30 minutes. Change the timeout argument, but be sure to not save new versions in a loop.
dp.write_pandas(df, testpath.new_version(timeout=0.01))
print(testpath.versions())
timestamp            mb (int)
2024-08-28 15:09:47  0           ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet
2024-08-28 15:09:52  0           ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v2.parquet
dtype: object

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daplapath-1.1.2.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daplapath-1.1.2-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file daplapath-1.1.2.tar.gz.

File metadata

  • Download URL: daplapath-1.1.2.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.8

File hashes

Hashes for daplapath-1.1.2.tar.gz
Algorithm Hash digest
SHA256 13eabef3a3af3da4d618e7897cc020191e54c057154b763ed97eda9d1fbb288b
MD5 7e3b90bd4ac3041fc6d5ed49ee9b92dd
BLAKE2b-256 449f3af1db7bc7c4ffadf1af8666a6e3a7501040e405a67d7ed147d6b7baacd4

See more details on using hashes here.

File details

Details for the file daplapath-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: daplapath-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.8

File hashes

Hashes for daplapath-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cd4b015aecfedcf2cefd5f6abb6d2104aba660d63860d78ed625879202d6cace
MD5 c84f8fa98bd8c3f8401b76571b46183e
BLAKE2b-256 cfc06d7daf347d7a818461f14ed2229ea912ead4ff8d55d217ff9871045539ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page