JWPL工具处理维基百科wikipedia数据用于NLP

Posted 2021-02-03 xmaples

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了JWPL工具处理维基百科wikipedia数据用于NLP相关的知识，希望对你有一定的参考价值。

JWPL处理维基百科数据用于NLP

处理zhwiki

JWPL是一个Wikipedia处理工具，主要功能是将Wikipedia dump的文件经过处理、优化导入mysql数据库，用于NLP过程。以下以zhwiki-20170201为例。

JWPLDataMachine用以处理wiki dump数据，最终将数据导入mysql，用于NLP，表结构不同于wkipedia官方的表，这里的表是针对于NLP目的的。处理步骤如下：

数据格式转换。处理zhwiki dump的文件，转换为tsv格式数据，以便用mysqlimport导入到mysql中，命令

java -Xmx4g -cp $M2_REPO/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.datamachine/1.1.0/de.tudarmstadt.ukp.wikipedia.datamachine-1.1.0.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.mwdumper/1.1.0/de.tudarmstadt.ukp.wikipedia.mwdumper-1.1.0.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.wikimachine/1.1.0/de.tudarmstadt.ukp.wikipedia.wikimachine-1.1.0.jar:$M2_REPO/javax/activation/activation/1.1/activation-1.1.jar:$M2_REPO/javax/mail/mail/1.4.1/mail-1.4.1.jar:$M2_REPO/log4j/log4j/1.2.16/log4j-1.2.16.jar:$M2_REPO/net/sf/trove4j/trove4j/3.0.2/trove4j-3.0.2.jar:$M2_REPO/org/apache/ant/ant/1.8.3/ant-1.8.3.jar:$M2_REPO/org/apache/ant/ant-launcher/1.8.3/ant-launcher-1.8.3.jar:$M2_REPO/org/springframework/spring-asm/3.1.1.RELEASE/spring-asm-3.1.1.RELEASE.jar:$M2_REPO/org/springframework/spring-core/3.1.1.RELEASE/spring-core-3.1.1.RELEASE.jar:$M2_REPO/org/springframework/spring-beans/3.1.1.RELEASE/spring-beans-3.1.1.RELEASE.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine chinese 頁面分類 全部消歧義頁面 zhwiki-20170201/

其中：

$M2_REPO是maven2本地库的目录，这个目录默认为~/.m2/repository，可通过shell命令定义环境变量，命令：export M2_REPO=~/.m2/repository，请确保已经下载依赖包。
zhwiki-20170201/文件夹下有三个文件zhwiki-20170201-categorylinks.sql.gz zhwiki-20170201-pagelinks.sql.gz zhwiki-20170201-pages-articles.xml.bz2。根据实际情况改为自己存放这三个文件的目录。
-Xmx4g参数表示设置JVM最大内存为4g，在处理这个版本的wiki dump时，内存消耗实际上没超过2g，主要是消耗cpu资源。程序处理时，会在当前目录生成以日期命名的.txt日志文件，整个过程约一个小时。

创建数据库。linux shell下命令：

$ mysqladmin -u[USER] -p create [DB_NAME] DEFAULT CHARACTER SET utf8;

或mysql shell下

mysql> CREATE DATABASE [DB_NAME] DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

其中[USER], [DB_NAME]为用户名和数据库DB名，如在mysql shell中创建db命令如下：

mysql> CREATE DATABASE zhjwpl DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

注意：务必指定编码为utf8。

创建表。把jwpl_tables.sql中的内容复制粘贴到本地文件，在mysql shell中执行。

mysql> SOURCE jwpl_tables.sql;

简繁转换。（此步骤非必要，但建议完成）中文维基百科中有大量简繁混杂文本，连类别名、条目标题有的是简体，有的是繁体，我们可以将繁体转换为简体，尽管简繁转换不是简单的一一映射，一简对多繁、一繁对多简，但是，就字符长度来看，简繁是一对一的，即一个繁体字符转换到简体也只占一个字符。我们可以利用opencc工具进行简繁转换。由于导入数据库时只导output/*.txt（output中所有txt文件，实际上也只有txt文件，共11个），则只对output中的txt文件转换即可。output中，实际上只有Page.txt, PageMapLine.txt, page_categories.txt, Category.txt, MetaData.txt, page_redirects.txt 这6个文件包含中文，这些文件需要利用opencc进行转换，而category_inlinks.txt, category_outlinks.txt, category_pages.txt, page_inlinks.txt, page_outlinks.txt这5个文件只包含数值ID，没有中文，不需要转换。我们这里把output中需要转换的文件进行处理后输出到与output平行结构的目录chs中，文件名是对应的，将不需要转换的文件直接复制到chs目录中（或做软链接）。转换时的命令类似：

$ opencc -c t2s -i output/Page.txt -o chs/Page.txt

导入数据。JWPLDataMachine处理wiki dump的三个文件时，会在其所在目录下创建output目录，并生成11个txt文件，文件中数据是tsv格式（tab separated values），在mysql shell中可通过LOAD DATA INFILE...命令来导入，但命令不如使用mysqlimport方便，故不在mysql shell中导入，利用mysqlimport工具导入，假如当前工作目录是wiki dump三个文所在目录（zhwiki-20170201/），命令是:

$ mysqlimport -u root -p --local --default-character-set=utf8 zhjwpl chs/*.txt

执行一段时间（大约一两个小时）后，数据都被导入，输出提示每个表成功插入的记录（Recores）数量，这些数量和对应txt文件的行数应该是一致的。另外输出提示有几个警告（warning），忽略之。如果没有进行简繁转换，命令中的chs应该换为output目录。

使用

需要依赖de.tudarmstadt.ukp.wikipedia:de.tudarmstadt.ukp.wikipedia.api:${version}，使用较简单，参考官方示例。

说明

处理dump的文件时，JWPLDataMachine类接受以下几个参数： [LANGUAGE] [MAIN_CATEGORY_NAME] [DISAMBIGUATION_CATEGORY_NAME] [SOURCE_DIRECTORY]。
参数LANGUAGE指处理维基百科的语言，中文维基百科为chinese。
MAIN_CATEGORY_NAME指的是Wikipedia层次分类体系中的顶级类别名称，这个类别名可能会随时间变化，目前为“頁面分類”（2017年3月）。
DISAMBIGUATION_CATEGORY_NAME所有消歧页都会包含的类别。不像官网描述的是包含子消歧类别的类别（Category:“消歧义”），也不像官网所举英文例对应中文的类别。不然以“消歧义”（或在带繁体版中用其对应繁体版）得到的消歧页只有13个。这里参数应该是“全部消歧義頁面”。

如何获取顶级类别名：随便搜索一篇文章，如“分类”，点击其中一个类别标签，转到类别页面，URL类似/wiki/Category:分类系统，在“分类：分类系统”下方有个导航条，“页面分类 > 人文学科 > 文学 > 书籍 > 图书资讯科学 > 分类系统”，点击一级导航“页面分类”，进入页面“分类：页面分类”页面，文章标题是 “分类：页面分类”，正文有描述“这里是维基百科页面分类系统的最高级”，那么顶级类别就是这个页面中描述的类别，这个页面对应的URL类似/wiki/Category:頁面分類，则类别名是“頁面分類”，即URL中的名称，而非“分类：页面分类”或“页面分类”，类别名（category）、条目名（page title）在数据库中有的是繁体，即使是简体中文语言下的维基百科。

使用顶层类别名“頁面分類”和消歧义类别名“消歧义”进行处理后，在得出的1266195个page中，只有13个消歧页（disabiguation page），剩下的是article page，不知为何这么少，感觉不合理。13个消歧页的id分别是1832186 5376724 5420049 5431949 5455483 5463308 5470979 5511092 5544906 5553846 5553849 5566592 5566629，查看id为1832186的page，title为“义胆雄心”，于在线zhwiki查看，确实是消歧页，也就说简体的“消歧义”类别也是消歧类型，但只有13个结果显然也让人担心。

没有看dataMachine工具包的源码，担心是简繁体原因导致消歧页丢失，不想以繁体的“消歧義”作为JWPLDataMachine的消歧类别参数再尝试处理数据，过程太复杂。转而首先进行zhwiki dump数据的简繁转换，然后再用jwpl工具处理，这样jwpl接触到的数据就全是简体。

上述步骤得小改一下。假设处于目录zhwiki-20170201/，其中有下载的三个主要文件zhwiki-20170201-pagelinks.sql.gz zhwiki-20170201-categorylinks.sql.gz zhwiki-20170201-pages-articles.xml.bz2，进行以下步骤：

先把zhwiki dump数据转为简体。先解压3个文件，得到纯文本文件，而后用opencc转换，之后压缩成对应压缩格式的文件。命令如下：

#把3个压缩文件解压为文本文件
gunzip -k zhwiki-20170201-categorylinks.sql.gz    #解压
gunzip -k zhwiki-20170201-pagelinks.sql.gz    #解压
bunzip2 -k zhwiki-20170201-pages-articles.xml.bz2    #解压

cd .. && mkdir chswiki-20170201 #进入上级目录，并创建一个与zhwiki-20170201/平行的目录，以存放简体版中文维基数据

#对3个文本文件进行繁体转简体操作
opencc -c t2s -i zhwiki-20170201/zhwiki-20170201-categorylinks.sql -o chswiki-20170201/chsziki-20170201-categorylinks.sql    #繁体转简体
opencc -c t2s -i zhwiki-20170201/zhwiki-20170201-pagelinks.sql -o chswiki-20170201/chswiki-20170201-pagelinks.sql    #繁体转简体
opencc -c t2s -i zhwiki-20170201/zhwiki-20170201-pages-articles.xml -o chswiki-20170201/chswiki-20170201-pages-articles.xml    #繁体转简体

#将3个简体版文本文件压缩成对应格式的文件：2个.gz，一个.bz2
cd chswiki-20170201
gzip chswiki-20170201-categorylinks.sql
gzip chswiki-20170201-pagelinks.sql
bzip2 chswiki-20170201-pages-articles.xml

与上述的步骤一样了，用JWPLDataMatchine工具处理数据，但顶级类别名和消歧义类别名都是简体的，目录参数是对应的简体版数据的目录chswiki-20170201/，即“页面分类”、“消歧义”。命令行

cd .. # 工作目录换到上一级，即chswiki-20170201/与zhwiki-20170201/目录所在的目录
java -Xmx4g -cp $M2_REPO/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.datamachine/1.1.0/de.tudarmstadt.ukp.wikipedia.datamachine-1.1.0.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.mwdumper/1.1.0/de.tudarmstadt.ukp.wikipedia.mwdumper-1.1.0.jar:$M2_REPO/de/tudarmstadt/ukp/wikipedia/de.tudarmstadt.ukp.wikipedia.wikimachine/1.1.0/de.tudarmstadt.ukp.wikipedia.wikimachine-1.1.0.jar:$M2_REPO/javax/activation/activation/1.1/activation-1.1.jar:$M2_REPO/javax/mail/mail/1.4.1/mail-1.4.1.jar:$M2_REPO/log4j/log4j/1.2.16/log4j-1.2.16.jar:$M2_REPO/net/sf/trove4j/trove4j/3.0.2/trove4j-3.0.2.jar:$M2_REPO/org/apache/ant/ant/1.8.3/ant-1.8.3.jar:$M2_REPO/org/apache/ant/ant-launcher/1.8.3/ant-launcher-1.8.3.jar:$M2_REPO/org/springframework/spring-asm/3.1.1.RELEASE/spring-asm-3.1.1.RELEASE.jar:$M2_REPO/org/springframework/spring-core/3.1.1.RELEASE/spring-core-3.1.1.RELEASE.jar:$M2_REPO/org/springframework/spring-beans/3.1.1.RELEASE/spring-beans-3.1.1.RELEASE.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine chinese 页面分类 消歧义 chswiki-20170201/

后面步骤与上述步骤一样，当然，不再进行上述步骤4中的简繁转换。
结果。得到的结果和之前一样，13个同样的消歧页。等有时间再找原因吧。

JWPL处理enwiki

命令行

...JWPLDataMachine english Contents All_disambiguation_pages wikidata

开始运行时间2017-04-02 00:52:27 星期日，结束时间2017-04-02 14:33:13 星期日。
输出记录条数信息

nrOfCategories: 1572174
nrOfPage: 13094655
nrOfRedirects before testing the validity of the destination:7717778

内存峰值不超过2G，cpu（4核）使用率大约在15%~45%。

enwiki-20170320
英文维基百科。用jwpl处理导入mysql后，对应database数据目录有70G大小。

mysql表意义

JWPL只关心namespace=0的page（即常见的词条）。jwpl会生成的几张表，对于维基百科一个条目（page）相关的表（page, category等），有字段id、pageId，都唯一标识page，二者是一致的，可认为前者就是page_id。各表及其字段含义如下：

category --> 类别信息（维基百科的category）
- pageId：类别ID，唯一
- name：类别名

Wikipedia的类别具有层次结构，想象成一棵类别树，对于category的inlink、outlink，指的是category父类、子类，A->B即B的inlink为A，则A是B的父类。知道了inlink，outlink就不难理解了。

category_inlinks --> 类别父类？？
- id：类别ID，不唯一
- inLinks：该类别父类ID
category_outlinks --> 类别子类

id：类别ID，不唯一

outLinks：该类别子类ID，也就是该类别指向的那些类别的ID
category_pages --> 类别-页面关联
- id：类别ID，不唯一
- pages：属于该类别的页面
MetaData --> 该版本wikipedia的元信息，只有一条记录
- language：语言
- disambiguationCategory：运行DataMachine填写的消歧类别
- mainCategory：运行DataMachine填写的顶层类别
- nrofPages：页面个数
- nrofRedirects：重定向个数
- nrofDisambiguationPages：消歧页面个数
- nrofCategories：类别个数
- version
Page --> 页面信息（维基条目），表数据不包含重定向页面，但包含消歧页，后者通过字段isDisambiguation标记。
- pageId：页面ID
- name：名字
- text：全文，包含wikimeida的标记信息
- isDisambiguation：是否是消岐页面
page_categories --> 页面与类别的关系表与category_pages信息重复
page_inlinks --> 指向页面的页面信息
- id：实际上是页面ID，非主键且不唯一
- inLinks：指向该页面ID的页面ID
- (id,inlinks)组也不唯一
page_outlinks 页面外链信息表
- id：页面ID，不唯一
- outlinks：该页面ID指向的页面ID
- (id, outlinks)组合也不唯一，并且存在outlinks指向自己的情况，如chsjwpl中存在id=13,outlinks=13。
page_redirects 重定向记录
- id：重定向目标页面ID，不唯一
- redirects：（字符串类型）被重定向的title
- 存在记录“139 孙文”，指的是title为“孙文”的page是一个重定向page，它指向的是id为139的page（article）。
PageMapLine --> 所有页面的title信息，处理重定向等页面时有用处
- id：该页面ID
- name：页面title
- pageID：页面ID，如果该页面是重定向页面，该pageID是其指向的那个包含信息的页面ID
- stem：？？？
- lemma：？？？

下载Maven依赖包

首先正确安装maven工具，以命令mvn --version测试mvn，输出maven的版本信息则安装正确。

上面的一大串依赖包要挨个手动下载就太麻烦了，可利用mvn命令自动下载，步骤如下：

进入临时目录，创建一个临时项目目录，创建pom.xml，填入依赖信息。

cd /tmp
mkdir tmp-proj && cd tmp-proj

然后创建pom.xml文件，并填入以下内容并保存
```xml

<groupId>whatever</groupId>
<artifactId>whatever</artifactId>
<version>whatever</version>
<name>whatever</name>
<dependencies>
    <!-- .api在访问时使用 -->
    <dependency>
        <groupId>de.tudarmstadt.ukp.wikipedia</groupId>
        <artifactId>de.tudarmstadt.ukp.wikipedia.api</artifactId>
        <version>1.1.0</version>
    </dependency>
    <!-- .api依赖的 -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.38</version>
    </dependency>
    <!-- .datamachine处理数据时使用 -->
    <dependency>
        <groupId>de.tudarmstadt.ukp.wikipedia</groupId>
        <artifactId>de.tudarmstadt.ukp.wikipedia.datamachine</artifactId>
        <version>1.1.0</version>
    </dependency>
</dependencies>

2. 自动下载jar包。在命令行执行：bash
#确保处于pom.xml所在目录，然后执行下列命令
mvn dependency:resolve
```
mvn工具读取pom.xml文件，并自动下载所有依赖的jar包到maven仓库

如果想知道maven项目依赖的所有jar包的本地路径，可通过命令mvn dependency:build-classpath -Dmdep.outputFile=/tmp/proj-classpath.txt得知所依赖jar包的文件路径，以路径分隔符（Windows下是分号，linux下是冒号）分割路径，其中命令行参数-Dmdep.outputFile=/tmp/proj-classpath.txt意思是告诉mvn工具，把jar包路径输出到/tmp/proj-classpath.txt文件中，这是个可选参数，如果不加，将随mvn的其他信息输出到标准输出。

附：

文件jwpl_tables.sql内容：

/*******************************************************************************
 * Copyright 2016
 * Ubiquitous Knowledge Processing (UKP) Lab
 * Technische Universit?t Darmstadt
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *******************************************************************************/
-- MySQL dump 10.11
--
-- Host: localhost    Database: jwpl_tables
-- ------------------------------------------------------
-- Server version   5.0.37-community-nt

/*!40101 SET @[email protected]@CHARACTER_SET_CLIENT */;
/*!40101 SET @[email protected]@CHARACTER_SET_RESULTS */;
/*!40101 SET @[email protected]@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @[email protected]@TIME_ZONE */;
/*!40103 SET TIME_ZONE=‘+00:00‘ */;
/*!40014 SET @[email protected]@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @[email protected]@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @[email protected]@SQL_MODE, SQL_MODE=‘NO_AUTO_VALUE_ON_ZERO‘ */;
/*!40111 SET @[email protected]@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `Category`
--

DROP TABLE IF EXISTS `Category`;
CREATE TABLE `Category` (
  `id` bigint(20) NOT NULL auto_increment,
  `pageId` int(11) default NULL,
  `name` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `pageId` (`pageId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `Category`
--

LOCK TABLES `Category` WRITE;
/*!40000 ALTER TABLE `Category` DISABLE KEYS */;
/*!40000 ALTER TABLE `Category` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `category_inlinks`
--

DROP TABLE IF EXISTS `category_inlinks`;
CREATE TABLE `category_inlinks` (
  `id` bigint(20) NOT NULL,
  `inLinks` int(11) default NULL,
  KEY `FK3F433773E46A97CC` (`id`),
  KEY `FK3F433773BB482769` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `category_inlinks`
--

LOCK TABLES `category_inlinks` WRITE;
/*!40000 ALTER TABLE `category_inlinks` DISABLE KEYS */;
/*!40000 ALTER TABLE `category_inlinks` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `category_outlinks`
--

DROP TABLE IF EXISTS `category_outlinks`;
CREATE TABLE `category_outlinks` (
  `id` bigint(20) NOT NULL,
  `outLinks` int(11) default NULL,
  KEY `FK9885334CE46A97CC` (`id`),
  KEY `FK9885334CBB482769` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `category_outlinks`
--

LOCK TABLES `category_outlinks` WRITE;
/*!40000 ALTER TABLE `category_outlinks` DISABLE KEYS */;
/*!40000 ALTER TABLE `category_outlinks` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `category_pages`
--

DROP TABLE IF EXISTS `category_pages`;
CREATE TABLE `category_pages` (
  `id` bigint(20) NOT NULL,
  `pages` int(11) default NULL,
  KEY `FK71E8D943E46A97CC` (`id`),
  KEY `FK71E8D943BB482769` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `category_pages`
--

LOCK TABLES `category_pages` WRITE;
/*!40000 ALTER TABLE `category_pages` DISABLE KEYS */;
/*!40000 ALTER TABLE `category_pages` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `MetaData`
--

DROP TABLE IF EXISTS `MetaData`;
CREATE TABLE `MetaData` (
  `id` bigint(20) NOT NULL auto_increment,
  `language` varchar(255) default NULL,
  `disambiguationCategory` varchar(255) default NULL,
  `mainCategory` varchar(255) default NULL,
  `nrofPages` bigint(20) default NULL,
  `nrofRedirects` bigint(20) default NULL,
  `nrofDisambiguationPages` bigint(20) default NULL,
  `nrofCategories` bigint(20) default NULL,
  `version` varchar(255) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `MetaData`
--

LOCK TABLES `MetaData` WRITE;
/*!40000 ALTER TABLE `MetaData` DISABLE KEYS */;
/*!40000 ALTER TABLE `MetaData` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `Page`
--

DROP TABLE IF EXISTS `Page`;
CREATE TABLE `Page` (
  `id` bigint(20) NOT NULL auto_increment,
  `pageId` int(11) default NULL,
  `name` varchar(255) default NULL,
  `text` longtext,
  `isDisambiguation` bit(1) default NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `pageId` (`pageId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `Page`
--

LOCK TABLES `Page` WRITE;
/*!40000 ALTER TABLE `Page` DISABLE KEYS */;
/*!40000 ALTER TABLE `Page` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `page_categories`
--

DROP TABLE IF EXISTS `page_categories`;
CREATE TABLE `page_categories` (
  `id` bigint(20) NOT NULL,
  `pages` int(11) default NULL,
  KEY `FK72FB59CC1E350EDD` (`id`),
  KEY `FK72FB59CC75DCF4FA` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `page_categories`
--

LOCK TABLES `page_categories` WRITE;
/*!40000 ALTER TABLE `page_categories` DISABLE KEYS */;
/*!40000 ALTER TABLE `page_categories` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `page_inlinks`
--

DROP TABLE IF EXISTS `page_inlinks`;
CREATE TABLE `page_inlinks` (
  `id` bigint(20) NOT NULL,
  `inLinks` int(11) default NULL,
  KEY `FK91C2BC041E350EDD` (`id`),
  KEY `FK91C2BC0475DCF4FA` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `page_inlinks`
--

LOCK TABLES `page_inlinks` WRITE;
/*!40000 ALTER TABLE `page_inlinks` DISABLE KEYS */;
/*!40000 ALTER TABLE `page_inlinks` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `page_outlinks`
--

DROP TABLE IF EXISTS `page_outlinks`;
CREATE TABLE `page_outlinks` (
  `id` bigint(20) NOT NULL,
  `outLinks` int(11) default NULL,
  KEY `FK95F640DB1E350EDD` (`id`),
  KEY `FK95F640DB75DCF4FA` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `page_outlinks`
--

LOCK TABLES `page_outlinks` WRITE;
/*!40000 ALTER TABLE `page_outlinks` DISABLE KEYS */;
/*!40000 ALTER TABLE `page_outlinks` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `page_redirects`
--

DROP TABLE IF EXISTS `page_redirects`;
CREATE TABLE `page_redirects` (
  `id` bigint(20) NOT NULL,
  `redirects` varchar(255) default NULL,
  KEY `FK1484BA671E350EDD` (`id`),
  KEY `FK1484BA6775DCF4FA` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `page_redirects`
--

LOCK TABLES `page_redirects` WRITE;
/*!40000 ALTER TABLE `page_redirects` DISABLE KEYS */;
/*!40000 ALTER TABLE `page_redirects` ENABLE KEYS */;
UNLOCK TABLES;

--
-- Table structure for table `PageMapLine`
--

DROP TABLE IF EXISTS `PageMapLine`;
CREATE TABLE `PageMapLine` (
  `id` bigint(20) NOT NULL auto_increment,
  `name` varchar(255) default NULL,
  `pageID` int(11) default NULL,
  `stem` varchar(255) default NULL,
  `lemma` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

--
-- Dumping data for table `PageMapLine`
--

LOCK TABLES `PageMapLine` WRITE;
/*!40000 ALTER TABLE `PageMapLine` DISABLE KEYS */;
/*!40000 ALTER TABLE `PageMapLine` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET [email protected]_TIME_ZONE */;

/*!40101 SET [email protected]_SQL_MODE */;
/*!40014 SET [email protected]_FOREIGN_KEY_CHECKS */;
/*!40014 SET [email protected]_UNIQUE_CHECKS */;
/*!40101 SET [email protected]_CHARACTER_SET_CLIENT */;
/*!40101 SET [email protected]_CHARACTER_SET_RESULTS */;
/*!40101 SET [email protected]_COLLATION_CONNECTION */;
/*!40111 SET [email protected]_SQL_NOTES */;

-- Dump completed on 2008-02-11 12:33:30

以上是关于JWPL工具处理维基百科wikipedia数据用于NLP的主要内容，如果未能解决你的问题，请参考以下文章