Skip to main content

Abstraction layer for Living Standards Measurement Survey data

Project description

* Streaming dvc files
A =dvc pull= will download dvc files to your local repository.
But this may not be the best way to proceed! In particular, =dvc=
offers an api which permits one to "stream" or cache files, leaving
your storage local to the working repository free of big data
files.

To illustrate,
#+begin_src python
import dvc.api
import pandas as pd

with dvc.api.open('BigRemoteFile.dta',mode='rb') as dta:
df = pd.read_stata(dta)
#+end_src
This will result in a =pandas.DataFrame= in RAM, but will use no
additional disk (except that, depending on what's being used as the
dvc store, the file may actually be stored in =.dvc/cache=; this
cache can be cleared with =dvc gc=).

** Pulling dvc files
If you need the actual file instead of a "stream" you can instead
"pull" the dvc files, using
#+begin_src sh
dvc pull
#+end_src
and files should be added from the remote dvc data store to your
working repository.

* Adding New Data
** Additional S3 Credentials
Write access to the remote s3 repository requires additional credentials; contact =ligon@berkeley.edu= to obtain these.

** Procedure to Add Data
To add a new LSMS-style survey to the repo, you'll follow the
following steps. Here we give the example of adding a 2015--16
survey from Uganda, obtained from
https://microdata.worldbank.org/index.php/catalog/3460. The same
steps should work for you /mutatis mutandis/:

1. Create a directory corresponding to the country or area; e.g.,
#+begin_src sh
mkdir Uganda
#+end_src
2. Create a /sub/-directory indicating the time period for the
survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16
#+end_src
3. Create a =Documentation= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Documentation
#+end_src
In this directory include the following files:
- SOURCE :: A text file giving both a url (if available) and
citation information for the dataset.
- LICENSE :: A text file containing a description of the license
or other terms under which you've obtained the data.
4. Add other documentation useful for understanding the data to the
=Documentation= sub-directory.

5. Add all the contents of the =Documentation= folder to the =git= repo;
e.g.,
#+begin_src sh
cd ./Uganda/2015-16/Documentation
git add .
git commit -m"Add Uganda 2015-16 documentation to repo."
git push
#+end_src

6. Create a =Data= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Data
#+end_src

7. Obtain a copy of the data you're interested in, perhaps as a zip
file or other archive. Store this in some temporary place, and
unzip (or whatever) the files into the relevant Country/Year/Data
directory, taking care to preserve any useful directory structure
in the archive. E.g.,
#+begin_src sh
cd Uganda/2015-16 && unzip -j /tmp/UGA_2015_UNPS_v01_M_STATA8.zip
#+end_src
8. Add the data you've unarchived to =dvc=, then add the /pointers/
(i.e., files with a .dvc extension to git). For the Uganda case we assume that
all the relevant data comes in the form of =stata= *.dta files,
since this is what we downloaded from the World Bank. For example,
#+begin_src sh
cd ../Data
dvc add *.dta
git commit -m"Add Uganda/2015-16/Data/*.dta files to dvc store."
git pull && git push
#+end_src
9. Push the data files to the dvc store. Make sure you have good
internet connection! Then a simple
#+begin_src sh
dvc push
#+end_src
will copy the data to the remote data store. NB: If this is the
first time you've done this for this repository, then you'll
first need to jump through some simple hoops to authenticate with
gdrive.
10. With the files pushed to the dvc store, you won't need them
locally anymore, so you can do something like
#+begin_src sh
cd ../Data && rm *.dta
#+end_src
or (if you have a more complex directory structure) perhaps
#+begin_src sh
find ../Data -name \*.dta -exec rm \{\} \;
#+end_src

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsms_library-0.2.3.tar.gz (18.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lsms_library-0.2.3-py3-none-any.whl (21.7 MB view details)

Uploaded Python 3

File details

Details for the file lsms_library-0.2.3.tar.gz.

File metadata

  • Download URL: lsms_library-0.2.3.tar.gz
  • Upload date:
  • Size: 18.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.3.tar.gz
Algorithm Hash digest
SHA256 7cfedffd6c59228df33e8e7317f2e35386d9eca9cae674eeda8b56147d9d56a0
MD5 24d45541e88035ad62dd95d48abfcc69
BLAKE2b-256 55b8a1cbd4da2dca553d0d4c376732ac3a757e253141465bad1c4601796429c2

See more details on using hashes here.

File details

Details for the file lsms_library-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: lsms_library-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 21.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ac3ec333b2a3f6fbe69f6ce49a5306258b1daa411a5c55a75fed6b694c809bfa
MD5 0b839b2cd8599621ffb0fbf0c810f614
BLAKE2b-256 df875a54ade12437667cac1c0c0fbbc5780591a814ce62437284ec3e41e1c91b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page