Nutch的初步搭建(IDEA)

Posted 言灵之书

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Nutch的初步搭建(IDEA)相关的知识,希望对你有一定的参考价值。

1.环境搭建:ant,从http://ant.apache.org/下载apache-ant-1.9.9-bin.zip;解压指定目录,配置环境变量,ANT_HOME : F:\\life\\rainofsky\\apache-ant-1.9.9,path中新增:%ANT_HOME%\\bin。

2.下载Nutch代码:http://nutch.apache.org/downloads.html;  

解压完成后,修改ivy/ivy.xml

启用以下两个依赖

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
  
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

3.在Nutch根目录:打开命令窗口:运行:ant eclipse -verbose

就会一直在下载jar包,这个时间好长。需要差不多半个小时。个人感觉是jar包路径也需要配置。一直在c:盘下载文件,心疼我的电脑····不过内存多的话没关系了。

29分钟也可以接受。

4.idea导入Nutch:

5.修改conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!--瓟虫的名字-->
    <property>
        <name>http.agent.name</name>
        <value>mySplider</value>
    </property>
    <!--瓟虫接受的语言-->
    <property>
        <name>http.accept.language</name>
        <value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
        <description>Value of the “Accept-Language” request header field.
            This allows selecting non-English language as default one to retrieve.
            It is a useful setting for search engines build for certain national group.</description>
    </property>
    <!--瓟虫文本的编码-->
    <property>
        <name>parser.character.encoding.default</name>
        <value>utf-8</value>
        <description>The character encoding to fall back to when no other information
            is available</description>
    </property>
    <!--瓟虫插件的目录-->
    <property>
        <name>plugin.folders</name>
        <value>src/plugin</value>
        <description>Directories where nutch plugins are located. Each
            element may be a relative or absolute path. If absolute, it is used
            as is. If relative, it is searched for on the classpath.</description>
    </property>
    <!--瓟虫存储指定用sql-->
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.sql.store.SqlStore</value>
        <description>The Gora DataStore class for storing and retrieving data.
            Currently the following stores are available: ….</description>
    </property>
    <!--生成的批次id-->
    <property>
        <name>generate.batch.id</name>
        <value>*</value>
    </property>
</configuration>

6.配置 conf/gora.properties 

gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.datastore.autocreateschema=true
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=password

7.创建mysql数据库和表结构

CREATE TABLE webpage (
 
id varchar(256) NOT NULL,
 
headers blob,
 
text longtext DEFAULT NULL,
 
status int(11) DEFAULT NULL,
 
markers blob,
 
parseStatus blob,
 
modifiedTime bigint(20) DEFAULT NULL,
 
prevModifiedTime bigint(20) DEFAULT NULL,
 
score float DEFAULT NULL,
 
typ varchar(32) CHARACTER SET latin1 DEFAULT NULL,
 
batchId varchar(32) CHARACTER SET latin1 DEFAULT NULL,
 
baseUrl varchar(256) DEFAULT NULL,
 
content longblob,
 
title text DEFAULT NULL,
 
reprUrl varchar(256) DEFAULT NULL,
 
fetchInterval int(11) DEFAULT NULL,
 
prevFetchTime bigint(20) DEFAULT NULL,
 
inlinks mediumblob,
 
prevSignature blob,
 
outlinks mediumblob,
 
fetchTime bigint(20) DEFAULT NULL,
 
retriesSinceFetch int(11) DEFAULT NULL,
 
protocolStatus blob,
 
signature blob,
 
metadata blob,
 
PRIMARY KEY (id)
 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  

在执行这个sql语句报错了:

查了好多资料,发现这个版本最多255个字符,所以把256修改成255就好了。

这样环境就配置好了,可以运行了。不过这个需要测试下。后续会更新测试情况。

 

 

以上是关于Nutch的初步搭建(IDEA)的主要内容,如果未能解决你的问题,请参考以下文章

[Nutch]Nutch2.3+Hadoop+HBase+Solr在Ubuntu环境搭建

[Nutch]Nutch+Eclipse+Tomcat+Solr+Cygwin搭建Windows开发环境

Nutch+Hadoop集群搭建

Nutch+Hadoop集群搭建

Mac自己搭建爬虫搜索引擎(nutch+elasticsearch是失败的尝试,改用scrapy+elasticsearch)

使用nutch搭建类似百度/谷歌的搜索引擎