Make zimfile from stackexchange dump
Project description
# Sotoki
*Stack Overflow to Kiwix*
The goal of this project is to create a suite of tools to create
[zim](http://www.openzim.org) files required by
[kiwix](http://kiwix.org/) reader to make available [Stack Overflow](https://stackoverflow.com/)
offline (without access to Internet).
## Getting started
Download the last [stackexchange dump](https://archive.org/details/stackexchange)
using BitTorrent (only "superuser.com.7z" is necessary) and put it in the Sotoki
source code root.
The use of btrfs as a file system is recommended (and required for stackoverflow)
Install non python dependencies:
```
sudo apt-get install jpegoptim pngquant gifsicle advancecomp python-pip python-virtualenv python-dev libxml2-dev libxslt1-dev libbz2-dev p7zip-full python-pillow gif2apng
```
Create a virtual environment for python:
```
virtualenv --system-site-packages venv
```
Activate the virtual enviroment:
```
source venv/bin/activate
```
Install this lib:
```
pip install sotoki
```
Copy your stackexchange site dump (.7z file, for example `superuser.com.7z`) and `unzip` it to `work/dump/`:
```
mkdir -p work/dump/
cp superuser.com.7z work/dump/
cd work/dump
7z e superuser.com.7z
rename 'y/A-Z/a-z/' *
```
Go back at the sotoki root and run the pipeline:
```
sotokirun [url of stackechange website] [publisher] [--directory (optional)] [--nozim (optional)]
```
If you want to restart sotoki after a run, you must remove work/output directory
*Stack Overflow to Kiwix*
The goal of this project is to create a suite of tools to create
[zim](http://www.openzim.org) files required by
[kiwix](http://kiwix.org/) reader to make available [Stack Overflow](https://stackoverflow.com/)
offline (without access to Internet).
## Getting started
Download the last [stackexchange dump](https://archive.org/details/stackexchange)
using BitTorrent (only "superuser.com.7z" is necessary) and put it in the Sotoki
source code root.
The use of btrfs as a file system is recommended (and required for stackoverflow)
Install non python dependencies:
```
sudo apt-get install jpegoptim pngquant gifsicle advancecomp python-pip python-virtualenv python-dev libxml2-dev libxslt1-dev libbz2-dev p7zip-full python-pillow gif2apng
```
Create a virtual environment for python:
```
virtualenv --system-site-packages venv
```
Activate the virtual enviroment:
```
source venv/bin/activate
```
Install this lib:
```
pip install sotoki
```
Copy your stackexchange site dump (.7z file, for example `superuser.com.7z`) and `unzip` it to `work/dump/`:
```
mkdir -p work/dump/
cp superuser.com.7z work/dump/
cd work/dump
7z e superuser.com.7z
rename 'y/A-Z/a-z/' *
```
Go back at the sotoki root and run the pipeline:
```
sotokirun [url of stackechange website] [publisher] [--directory (optional)] [--nozim (optional)]
```
If you want to restart sotoki after a run, you must remove work/output directory
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sotoki-0.1.tar.gz
(375.1 kB
view details)
Built Distribution
sotoki-0.1-py2-none-any.whl
(377.9 kB
view details)
File details
Details for the file sotoki-0.1.tar.gz
.
File metadata
- Download URL: sotoki-0.1.tar.gz
- Upload date:
- Size: 375.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e97bd5b9230a499a83c7d900dda71b845ecbd9d486fc815536371130244c82c |
|
MD5 | 640a779df3abbdf27faa12f76267f3e4 |
|
BLAKE2b-256 | 490a7a7fc848b4e3524996b157f03948a603b0e240be95bd30777670bbe12cf7 |
File details
Details for the file sotoki-0.1-py2-none-any.whl
.
File metadata
- Download URL: sotoki-0.1-py2-none-any.whl
- Upload date:
- Size: 377.9 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd16e64927fbeeffb7e13450d255acc9d44e6c98869b3b8cac0966c4e285a87d |
|
MD5 | f6ab77aca5d43775f33fe09a94c0a3da |
|
BLAKE2b-256 | 0562ca9127a830cfd7a7d2456e85967229a92733bbf53544a42f6f865fb50087 |