Skip to main content

Abstraction layer for Living Standards Measurement Survey data

Project description

* Streaming dvc files
A =dvc pull= will download dvc files to your local repository.
But this may not be the best way to proceed! In particular, =dvc=
offers an api which permits one to "stream" or cache files, leaving
your storage local to the working repository free of big data
files.

To illustrate,
#+begin_src python
import dvc.api
import pandas as pd

with dvc.api.open('BigRemoteFile.dta',mode='rb') as dta:
df = pd.read_stata(dta)
#+end_src
This will result in a =pandas.DataFrame= in RAM, but will use no
additional disk (except that, depending on what's being used as the
dvc store, the file may actually be stored in =.dvc/cache=; this
cache can be cleared with =dvc gc=).

** Pulling dvc files
If you need the actual file instead of a "stream" you can instead
"pull" the dvc files, using
#+begin_src sh
dvc pull
#+end_src
and files should be added from the remote dvc data store to your
working repository.

* Adding New Data
** Additional S3 Credentials
Write access to the remote s3 repository requires additional credentials; contact =ligon@berkeley.edu= to obtain these.

** Procedure to Add Data
To add a new LSMS-style survey to the repo, you'll follow the
following steps. Here we give the example of adding a 2015--16
survey from Uganda, obtained from
https://microdata.worldbank.org/index.php/catalog/3460. The same
steps should work for you /mutatis mutandis/:

1. Create a directory corresponding to the country or area; e.g.,
#+begin_src sh
mkdir Uganda
#+end_src
2. Create a /sub/-directory indicating the time period for the
survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16
#+end_src
3. Create a =Documentation= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Documentation
#+end_src
In this directory include the following files:
- SOURCE :: A text file giving both a url (if available) and
citation information for the dataset.
- LICENSE :: A text file containing a description of the license
or other terms under which you've obtained the data.
4. Add other documentation useful for understanding the data to the
=Documentation= sub-directory.

5. Add all the contents of the =Documentation= folder to the =git= repo;
e.g.,
#+begin_src sh
cd ./Uganda/2015-16/Documentation
git add .
git commit -m"Add Uganda 2015-16 documentation to repo."
git push
#+end_src

6. Create a =Data= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Data
#+end_src

7. Obtain a copy of the data you're interested in, perhaps as a zip
file or other archive. Store this in some temporary place, and
unzip (or whatever) the files into the relevant Country/Year/Data
directory, taking care to preserve any useful directory structure
in the archive. E.g.,
#+begin_src sh
cd Uganda/2015-16 && unzip -j /tmp/UGA_2015_UNPS_v01_M_STATA8.zip
#+end_src
8. Add the data you've unarchived to =dvc=, then add the /pointers/
(i.e., files with a .dvc extension to git). For the Uganda case we assume that
all the relevant data comes in the form of =stata= *.dta files,
since this is what we downloaded from the World Bank. For example,
#+begin_src sh
cd ../Data
dvc add *.dta
git commit -m"Add Uganda/2015-16/Data/*.dta files to dvc store."
git pull && git push
#+end_src
9. Push the data files to the dvc store. Make sure you have good
internet connection! Then a simple
#+begin_src sh
dvc push
#+end_src
will copy the data to the remote data store. NB: If this is the
first time you've done this for this repository, then you'll
first need to jump through some simple hoops to authenticate with
gdrive.
10. With the files pushed to the dvc store, you won't need them
locally anymore, so you can do something like
#+begin_src sh
cd ../Data && rm *.dta
#+end_src
or (if you have a more complex directory structure) perhaps
#+begin_src sh
find ../Data -name \*.dta -exec rm \{\} \;
#+end_src

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsms_library-0.2.4.tar.gz (18.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lsms_library-0.2.4-py3-none-any.whl (21.7 MB view details)

Uploaded Python 3

File details

Details for the file lsms_library-0.2.4.tar.gz.

File metadata

  • Download URL: lsms_library-0.2.4.tar.gz
  • Upload date:
  • Size: 18.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.4.tar.gz
Algorithm Hash digest
SHA256 a3da9fa79588c49bf146753be1638e9243504552cc39bf673642c95976399d15
MD5 acfd476dfa0234292a7d684dbaa41c10
BLAKE2b-256 5888b11134f90aef484c99217cc9a3008ef0c7ba77fa5e389f4cfdb19baa2e6d

See more details on using hashes here.

File details

Details for the file lsms_library-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: lsms_library-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 21.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 80a17335a28abe6191a3f560ba453a6e32139ae12684dec19a0ccb39906d4d8c
MD5 e894b7ad4dd76bbc4447630900ceb0a0
BLAKE2b-256 e263ff3283565fff445378ab989e9a2bca9eeb98d0a73b6029ebb264e37c105c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page