Web lists-archives.com

Bug#858160: ITP: wikiextractor -- tool to extract plain text from a Wikipedia dump




Package: wnpp
Severity: wishlist
Owner: Ben Finney <bignose@xxxxxxxxxx>

* Package name    : wikiextractor
  Version         : 2.75
  Upstream Author : Giuseppe Attardi <attardi@xxxxxxxxxxx>
* URL             : http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
* License         : GPL-3
  Programming Lang: Python
  Description     : tool to extract plain text from a Wikipedia dump

    The Wikipedia maintainers provide, each month, an XML dump of all
    documents in the database: a single XML file containing the whole
    encyclopedia, that can be used for various kinds of analysis, such as
    statistics, service lists, etc.

    This Wikipedia extractor tool generates plain text from a Wikipedia
    database dump. It discards any other information or annotation
    present in Wikipedia pages, such as images, tables, references and
    lists.


Some works use Wikipedia data as part of their complete source. This
package will be useful for build chains that require processing that
data as source.

-- 
 \      “I put instant coffee in a microwave oven and almost went back |
  `\                                          in time.” —Steven Wright |
_o__)                                                                  |
Ben Finney <bignose@xxxxxxxxxx>

Attachment: signature.asc
Description: PGP signature