Skip to main content

Utility for parsing Wikipedia SQL dumps into CSVs.

Project description

Kensho Wikimedia for Natural Language Processing - SQL Dump Parser

kwnlp_sql_parser is a Python package for parsing Wikipedia SQL dumps into CSVs.

Quick Install (Requires Python >= 3.6)

pip install kwnlp-sql-parser

Examples

Basic Usage

To convert a Wikipedia MySQL/MariaDB dump into a CSV file, use the to_csv method of the WikipediaSqlDump class. By default, the CSV file is created in the current directory and includes all of the columns and rows in the SQL dump file.

import pandas as pd
from kwnlp_sql_parser import WikipediaSqlDump
file_path = "/path/to/data/enwiki-20200920-page.sql.gz"
wsd = WikipediaSqlDump(file_path)
wsd.to_csv()
df = pd.read_csv("enwiki-20200920-page.csv", keep_default_na=False, na_values=[""])
print(df.head())
   page_id  page_namespace            page_title page_restrictions  page_is_redirect  page_is_new  page_random    page_touched  page_links_updated  page_latest  page_len page_content_model  page_lang
0       10               0   AccessibleComputing               NaN                 1            0     0.331671  20200903074851        2.020090e+13    854851586        94           wikitext        NaN
1       12               0             Anarchism               NaN                 0            0     0.786172  20200920023613        2.020092e+13    979267494     88697           wikitext        NaN
2       13               0    AfghanistanHistory               NaN                 1            0     0.062150  20200909184138        2.020091e+13    783865149        90           wikitext        NaN
3       14               0  AfghanistanGeography               NaN                 1            0     0.952234  20200915100945        2.020091e+13    783865160        92           wikitext        NaN
4       15               0     AfghanistanPeople               NaN                 1            0     0.574721  20200917080644        2.020091e+13    783865293        95           wikitext        NaN

See the "Common Issues" section below for an explanation of the pandas read_csv kwargs.

Filtering Rows and Columns

In some situations, it is convenient to filter the Wikipedia SQL dumps before writing to CSV. For example, one might only be interested in the columns page_id and page_title for Wikipedia pages that are in the (Main/Article) namespace.

import pandas as pd
from kwnlp_sql_parser import WikipediaSqlDump
file_path = "/path/to/data/enwiki-20200920-page.sql.gz"
wsd = WikipediaSqlDump(
    file_path,
    keep_column_names=["page_id", "page_title"],
    allowlists={"page_namespace": ["0"]})
wsd.to_csv()
df = pd.read_csv("enwiki-20200920-page.csv", keep_default_na=False, na_values=[""])
print(df.head())
   page_id            page_title
0       10   AccessibleComputing
1       12             Anarchism
2       13    AfghanistanHistory
3       14  AfghanistanGeography
4       15     AfghanistanPeople

Note that you can also specify blocklists instead of allowlists if it is more convenient for your use case.

Common Issues

Not using string values in filters

All values in the allowlists and blocklists should be strings.

Pages with names treated as Null

Be carefull when reading the CSVs in your chosen software. Some packages will treat the following page titles as null values instead of strings,

In pandas this can be handled by reading in the CSV using,

df = pd.read_csv("enwiki-20200920-page.csv", keep_default_na=False, na_values=[""])

Supported Tables

License

Licensed under the Apache 2.0 License. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020-present Kensho Technologies, LLC. The present date is determined by the timestamp of the most recent commit in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwnlp_sql_parser-0.0.2.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

kwnlp_sql_parser-0.0.2-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file kwnlp_sql_parser-0.0.2.tar.gz.

File metadata

  • Download URL: kwnlp_sql_parser-0.0.2.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.15

File hashes

Hashes for kwnlp_sql_parser-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a9c3f9b25f9595b529a3db9efa040d0ed24193da2a82e4938232b0d17775522a
MD5 d87e64895c216cb3aa58a21a50c7d927
BLAKE2b-256 97a17c51add48ee651d1b3763e230378e980b5800629ca38206a94efdf36ec0e

See more details on using hashes here.

File details

Details for the file kwnlp_sql_parser-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for kwnlp_sql_parser-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b2a5e8055e4b2c91cb5274623b6cc656dbf645da7c330cb6375493267adfd620
MD5 f0614a8c097e76627e83af98ca90d3c5
BLAKE2b-256 fb49bf230b5e1e826c2208aa2e967e397e5878dba151f5339b49b3c1ecbbcb9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page