Skip to main content

A pathlib.Path class for dapla

Project description

dapla-path

pathlib.Path for dapla

Opprettet av: ort ort@ssb.no


Path (dapla)

import dapla as dp
import pandas as pd

from daplapath.path import Path
folder = Path('ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024')
folder
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024'

Fungerer som tekst

folder.startswith("ssb")
True
dp.FileClient.get_gcs_file_system().exists(folder)
True

Med metoder og attributter ala pathlib.Path

folder.exists()
True
folder.is_dir()
True
file = folder / "ABAS_kommune_utenhav_p2024_v1.parquet"
file
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
file.parent
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024'

Og noen pandas attributter

Uten å lese filen

file.columns
Index(['OBJTYPE', 'NAVN', "komm_nr", "fylke_nr", 'AREAL_GDB', 'SHAPE_Length',
       'SHAPE_Area', 'geometry'],
      dtype='object')
file.dtypes
OBJTYPE         string
NAVN            string
komm_nr       string
fylke_nr           string
AREAL_GDB       double
SHAPE_Length    double
SHAPE_Area      double
geometry        binary
dtype: object
file.shape
(481, 8)

Versjonering

file.version_number
1
print(file.versions())
timestamp            mb (int)
2024-05-19 12:31:02  941            .../ABAS_kommune_utenhav_p2024.parquet
2024-08-16 16:15:10  941         .../ABAS_kommune_utenhav_p2024_v1.parquet
Name: path, dtype: object
file.latest_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
file.highest_numbered_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v1.parquet'
# highest_numbered_version + 1
file.new_version()
'ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data/2024/ABAS_kommune_utenhav_p2024_v2.parquet'
# alltid False
file.new_version().exists()
False
# finner/fjerner versjonsnummer med regex-søk
file._version_pattern
'_v(\\d+)'

Branch tree

Filtre med hyperlenke. Gjør at man kopierer stien når man klikker på den.

print(
    Path("ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data").tree()
)
ssb-kart-data-delt-geo-prod/analyse_data/klargjorte-data /
    └──2000 /
        └──SSB_tettsted_flate_p2000.parquet
        └──SSB_tettsted_flate_p2000_v1.parquet
    └──2002 /
        └──SSB_tettsted_flate_p2002.parquet
        └──SSB_tettsted_flate_p2002_v1.parquet
    └──2003 /
        └──SSB_tettsted_flate_p2003.parquet
        └──SSB_tettsted_flate_p2003_v1.parquet
    └──2004 /
        └──SSB_tettsted_flate_p2004.parquet
        └──SSB_tettsted_flate_p2004_v1.parquet
    └──2005 /
        └──SSB_tettsted_flate_p2005.parquet
        └──SSB_tettsted_flate_p2005_v1.parquet
    └──2006 /
        └──SSB_tettsted_flate_p2006.parquet
        └──SSB_tettsted_flate_p2006_v1.parquet
    └──2007 /
        └──SSB_tettsted_flate_p2007.parquet
        └──SSB_tettsted_flate_p2007_v1.parquet
    └──2008 /
        └──SSB_tettsted_flate_p2008.parquet
        └──SSB_tettsted_flate_p2008_v1.parquet
        └──SSB_tettsted_ringbuffer_p2008.parquet
        └──(...)
    └──2009 /
        └──SSB_tettsted_flate_p2009.parquet
        └──SSB_tettsted_flate_p2009_v1.parquet
    └──2010 /
        └──SOL_arealressurs_flate_p2010.parquet
        └──SOL_arealressurs_flate_p2010_v1.parquet
    └──2011 /
        └──SOL_Arstat_flate_p2011.parquet
        └──SOL_Arstat_flate_p2011_v1.parquet
        └──SSB_tettsted_flate_p2011.parquet
        └──(...)
    └──2012 /
        └──ABAS_fylke_flate_p2012_v1.parquet
        └──ABAS_fylke_linje_p2012_v1.parquet
        └──ABAS_grunnkrets_flate_p2012_v1.parquet
        └──(...)
    └──2013 /
        └──ABAS_fylke_flate_p2013_v1.parquet
        └──ABAS_kommune_flate_p2013_v1.parquet
        └──DEK_eiendom_flate_p2013_v1.parquet
        └──(...)
    └──2014 /
        └──DEK_eiendom_flate_p2014_v1.parquet
        └──FKB_anlegg_flate_p2014_v1.parquet
        └──FKB_anlegg_linje_p2014_v1.parquet
        └──(...)
    └──2015 /
        └──ABAS_grunnkrets_flate_p2015_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2015_v1.parquet
        └──ABAS_kommune_flate_p2015_v1.parquet
        └──(...)
    └──2016 /
        └──ABAS_fylke_flate_p2016_v1.parquet
        └──ABAS_grunnkrets_flate_p2016_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2016_v1.parquet
        └──(...)
    └──2017 /
        └──ABAS_fylke_flate_p2017_v1.parquet
        └──ABAS_grunnkrets_flate_p2017_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2017_v1.parquet
        └──(...)
    └──2018 /
        └──ABAS_fylke_flate_p2018_v1.parquet
        └──ABAS_grunnkrets_flate_p2018_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2018_v1.parquet
        └──(...)
    └──2019 /
        └──ABAS_fylke_flate_p2019_v1.parquet
        └──ABAS_grunnkrets_flate_p2019_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2019_v1.parquet
        └──(...)
    └──2020 /
        └──ABAS_fylke_flate_p2020_v1.parquet
        └──ABAS_grunnkrets_flate_p2020_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2020_v1.parquet
        └──(...)
    └──2021 /
        └──ABAS_fylke_flate_p2021_v1.parquet
        └──ABAS_grunnkrets_flate_p2021_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2021_v1.parquet
        └──(...)
    └──2022 /
        └──ABAS_fylke_flate_p2022_v1.parquet
        └──ABAS_grunnkrets_flate_p2022_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2022_v1.parquet
        └──(...)
    └──2023 /
        └──ABAS_KnrGamle_p2023_v1.parquet
        └──ABAS_fylke_flate_p2023_v1.parquet
        └──ABAS_grunnkrets_flate_p2023_v1.parquet
        └──(...)
    └──2024 /
        └──ABAS_fylke_flate_p2024_v1.parquet
        └──ABAS_grunnkrets_flate_p2024_v1.parquet
        └──ABAS_grunnkrets_utenhav_p2024_v1.parquet
        └──(...)

ls - få filstier, timestamp og størrelse

Med stier som kopieres (som ctrl + c) når man klipper på stien.

files_in_dir = file.parent.ls()
print(files_in_dir)
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-04-19 11:45:47  0                    .../N50_JernbaneStasjon_punkt_p2024.parquet
                     0                 .../N50_JernbaneStasjon_punkt_p2024_v1.parquet
                     0                           .../N50_lufthavn_punkt_p2024.parquet
                     0                        .../N50_lufthavn_punkt_p2024_v1.parquet
                                                         ...                         
2024-08-21 14:47:12  861                              .../SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../NVDB_veg_linje_p2024_v1.parquet
Name: path, Length: 127, dtype: object
# subclass av pandas.Series
type(files_in_dir)
daplapath.path.PathSeries
print(files_in_dir.loc[lambda x: x.gb > 10].keep_latest_versions())
timestamp            mb (int)
2024-07-18 00:13:09  17646        .../FKB_arealressurs_flate_p2024_v1.parquet
2024-08-20 14:03:16  19717       .../FKB_gronnstruktur_flate_p2024_v1.parquet
Name: path, dtype: object
# stiene er fortsatt Path
type(files_in_dir.iloc[0])
daplapath.path.Path
# velg ut filene
print(folder.ls().files)
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-04-19 11:45:47  0                    .../N50_JernbaneStasjon_punkt_p2024.parquet
                     0                 .../N50_JernbaneStasjon_punkt_p2024_v1.parquet
                     0                           .../N50_lufthavn_punkt_p2024.parquet
                     0                        .../N50_lufthavn_punkt_p2024_v1.parquet
                                                         ...                         
2024-08-21 14:47:12  861                              .../SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../NVDB_veg_linje_p2024_v1.parquet
Name: path, Length: 127, dtype: object
print(folder.ls().dirs)
Series([], Name: path, dtype: object)
# samme som .loc med x.str.contains
print(folder.ls().containing("kommune"))
timestamp            mb (int)
2024-04-19 11:44:12  11                       .../ABAS_kommune_flate_p2024_v1.parquet
2024-05-19 12:31:02  941                       .../ABAS_kommune_utenhav_p2024.parquet
2024-06-24 14:25:14  11                          .../ABAS_kommune_flate_p2024.parquet
2024-08-16 16:15:10  941                    .../ABAS_kommune_utenhav_p2024_v1.parquet
2024-08-23 14:59:36  152              .../SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
Name: path, dtype: object
print(file.parent.parent.ls(recursive=True).files)
timestamp            mb (int)
2024-04-19 11:43:21  0                 .../2022/N50_JernbaneStasjon_punkt_p2022_v1.parquet
2024-04-19 11:43:22  0                        .../2022/N50_lufthavn_punkt_p2022_v1.parquet
2024-04-19 11:43:23  0                      .../2022/NVE_Vindturbin_punkt_p2022_v1.parquet
                     0                    .../2022/NVE_Trafostasjon_punkt_p2022_v1.parquet
2024-04-19 11:43:24  0                     .../2022/S100_TekniskSit_flate_p2022_v1.parquet
                                                           ...                            
2024-08-21 14:47:12  861                              .../2024/SSB_hav_flate_p2024.parquet
2024-08-23 14:59:30  152                      .../2024/SSB_tettsted_flate_p2024_v1.parquet
2024-08-23 14:59:36  152              .../2024/SSB_tettsted_kommune_flate_p2024_v1.parquet
2024-08-23 15:34:21  1122        .../2024/SSB_tettsted_kommune_ringbuffer_p2024_v1.parquet
2024-08-23 17:11:32  740                          .../2024/NVDB_veg_linje_p2024_v1.parquet
Length: 1323, dtype: object

Write to testpath

testpath = Path('ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet')

# delete files first
for version in testpath.versions():
    version.rm_file()

testpath.exists()
False
df = pd.DataFrame({"x": [1,2,3], "y": [*"abc"]})

dp.write_pandas(df, testpath)

testpath.exists()
True
testpath.latest_version()
'ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet'
# highest_numbered_version + 1
testpath.new_version()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[31], line 2
      1 # highest_numbered_version + 1
----> 2 testpath.new_version()


File ~/daplapath/daplapath/path.py:805, in Path.new_version(self, timeout)
    803     time_should_be_at_least = pd.Timestamp.now() - pd.Timedelta(minutes=timeout)
    804     if timestamp[0] > time_should_be_at_least:
--> 805         raise ValueError(
    806             f"Latest version of the file was updated {timestamp[0]}, which "
    807             f"is less than the timeout period of {timeout} minutes. "
    808             "Change the timeout argument, but be sure to not save new "
    809             "versions in a loop."
    810         )
    812 return highest_numbered.add_to_version_number(1)


ValueError: Latest version of the file was updated 2024-08-28 15:09:47, which is less than the timeout period of 30 minutes. Change the timeout argument, but be sure to not save new versions in a loop.
dp.write_pandas(df, testpath.new_version(timeout=0.01))
print(testpath.versions())
timestamp            mb (int)
2024-08-28 15:09:47  0           ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v1.parquet
2024-08-28 15:09:52  0           ssb-areal-data-produkt-prod/arealstat/temp/test_df_p2023_v2.parquet
dtype: object

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daplapath-2.0.6.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daplapath-2.0.6-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file daplapath-2.0.6.tar.gz.

File metadata

  • Download URL: daplapath-2.0.6.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.8

File hashes

Hashes for daplapath-2.0.6.tar.gz
Algorithm Hash digest
SHA256 4e0babc4f85cfe252e31f96be5c7fb79a63e50ba4b10a822d67eb77f000b7bc9
MD5 4266750094c8771b68c14b0e959d7ac3
BLAKE2b-256 fc801f69ac50b627783b9df8f1878e8de36615f6ed34bdfdc90a60eb5b0ebdce

See more details on using hashes here.

File details

Details for the file daplapath-2.0.6-py3-none-any.whl.

File metadata

  • Download URL: daplapath-2.0.6-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.8

File hashes

Hashes for daplapath-2.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ca748b7e82ed14a3840ed9e7d755c74549a747c17a8f95250ec3ad3adfa5d42d
MD5 617dd9ca15ad69330e94337c6fbba6ad
BLAKE2b-256 b6308f82d174901d281c27b62e0b26c08e84ad9c7a5092b23c325397c4ca1d85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page