Skip to main content

Abstraction layer for Living Standards Measurement Survey data

Project description

* Streaming dvc files
A =dvc pull= will download dvc files to your local repository.
But this may not be the best way to proceed! In particular, =dvc=
offers an api which permits one to "stream" or cache files, leaving
your storage local to the working repository free of big data
files.

To illustrate,
#+begin_src python
import dvc.api
import pandas as pd

with dvc.api.open('BigRemoteFile.dta',mode='rb') as dta:
df = pd.read_stata(dta)
#+end_src
This will result in a =pandas.DataFrame= in RAM, but will use no
additional disk (except that, depending on what's being used as the
dvc store, the file may actually be stored in =.dvc/cache=; this
cache can be cleared with =dvc gc=).

** Pulling dvc files
If you need the actual file instead of a "stream" you can instead
"pull" the dvc files, using
#+begin_src sh
dvc pull
#+end_src
and files should be added from the remote dvc data store to your
working repository.

* Adding New Data
** Additional S3 Credentials
Write access to the remote s3 repository requires additional credentials; contact =ligon@berkeley.edu= to obtain these.

** Procedure to Add Data
To add a new LSMS-style survey to the repo, you'll follow the
following steps. Here we give the example of adding a 2015--16
survey from Uganda, obtained from
https://microdata.worldbank.org/index.php/catalog/3460. The same
steps should work for you /mutatis mutandis/:

1. Create a directory corresponding to the country or area; e.g.,
#+begin_src sh
mkdir Uganda
#+end_src
2. Create a /sub/-directory indicating the time period for the
survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16
#+end_src
3. Create a =Documentation= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Documentation
#+end_src
In this directory include the following files:
- SOURCE :: A text file giving both a url (if available) and
citation information for the dataset.
- LICENSE :: A text file containing a description of the license
or other terms under which you've obtained the data.
4. Add other documentation useful for understanding the data to the
=Documentation= sub-directory.

5. Add all the contents of the =Documentation= folder to the =git= repo;
e.g.,
#+begin_src sh
cd ./Uganda/2015-16/Documentation
git add .
git commit -m"Add Uganda 2015-16 documentation to repo."
git push
#+end_src

6. Create a =Data= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Data
#+end_src

7. Obtain a copy of the data you're interested in, perhaps as a zip
file or other archive. Store this in some temporary place, and
unzip (or whatever) the files into the relevant Country/Year/Data
directory, taking care to preserve any useful directory structure
in the archive. E.g.,
#+begin_src sh
cd Uganda/2015-16 && unzip -j /tmp/UGA_2015_UNPS_v01_M_STATA8.zip
#+end_src
8. Add the data you've unarchived to =dvc=, then add the /pointers/
(i.e., files with a .dvc extension to git). For the Uganda case we assume that
all the relevant data comes in the form of =stata= *.dta files,
since this is what we downloaded from the World Bank. For example,
#+begin_src sh
cd ../Data
dvc add *.dta
git commit -m"Add Uganda/2015-16/Data/*.dta files to dvc store."
git pull && git push
#+end_src
9. Push the data files to the dvc store. Make sure you have good
internet connection! Then a simple
#+begin_src sh
dvc push
#+end_src
will copy the data to the remote data store. NB: If this is the
first time you've done this for this repository, then you'll
first need to jump through some simple hoops to authenticate with
gdrive.
10. With the files pushed to the dvc store, you won't need them
locally anymore, so you can do something like
#+begin_src sh
cd ../Data && rm *.dta
#+end_src
or (if you have a more complex directory structure) perhaps
#+begin_src sh
find ../Data -name \*.dta -exec rm \{\} \;
#+end_src

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsms_library-0.2.9.dev0.tar.gz (18.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lsms_library-0.2.9.dev0-py3-none-any.whl (21.7 MB view details)

Uploaded Python 3

File details

Details for the file lsms_library-0.2.9.dev0.tar.gz.

File metadata

  • Download URL: lsms_library-0.2.9.dev0.tar.gz
  • Upload date:
  • Size: 18.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08174-g2f3b34fb3650

File hashes

Hashes for lsms_library-0.2.9.dev0.tar.gz
Algorithm Hash digest
SHA256 93f83e29d1b5928e1be4466fa5caddeb01b87811814814c53dfaa04544dd16ce
MD5 3b2ccad2b6d2147485adb6e29c2330ae
BLAKE2b-256 b0f9a6f4305e9e5b5f57bf4e0c3029aa9de5345eecb95fb630b059611040f448

See more details on using hashes here.

File details

Details for the file lsms_library-0.2.9.dev0-py3-none-any.whl.

File metadata

  • Download URL: lsms_library-0.2.9.dev0-py3-none-any.whl
  • Upload date:
  • Size: 21.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08174-g2f3b34fb3650

File hashes

Hashes for lsms_library-0.2.9.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f813dec5e7dd32d00f4c886bd6d3cb179013dab6084c2668e2679c2f8852c34
MD5 2e25928d980b3ed523146955098fa0a8
BLAKE2b-256 db1960f5d7b4be105ecef873615fd0f0e53f232028c119b390dccd643e1ae645

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page