步骤1:准备好eclipse、eclipse svn插件、MySQL准备好,mysql使用utf-8编码 步骤2:mysql建库,建表: CREATE DATABASE nutch ; CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;
`id` varchar(767) NOT NULL 这个在我本机是不能成功的,只能最大设置为100 所以改为:`id` varchar(100) NOT NULL 步骤3:从 https://svn.apache.org/repos/asf/nutch/tags/release-2.1 拉下代码,在本地创建Java project。本人因为试验过很多次,所以在此取项目名称为test。 步骤4:加src文件 在project explorer下右击项目,选择properties。进入java build path ,在source选项卡,删除src文件夹,选择“Add Folder ”,在这里把conf,src/bin,src/java,src/test,src/testresources,以及src/plugin文件夹下各个插件的src和test也加入进来。最终可以看到如下界面(test为项目名称):
在每个eclipse 项目文件夹下有 .classpath文件,打开 .classpath文件能看到:内容基本是这样的。 <classpathentry kind="src" path="conf"/> <classpathentry kind="src" path="src/java"/> <classpathentry kind="src" path="src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/java"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/test"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-link/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-http/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/java"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/> <classpathentry kind="src" path="src/plugin/index-basic/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/java"/> <classpathentry kind="src" path="src/bin"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/> <classpathentry kind="src" path="src/plugin/tld/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/> <classpathentry kind="src" path="src/plugin/index-basic/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/java"/> <classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/test"/> <classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/test"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/> <classpathentry kind="src" path="src/testresources"/>
步骤5:加入lib包: 切换到Libaries选项卡,“Add Library"->"IvyDE Managed Dependencies"->"Next",选择“Project”,选择ivyivy.xml文件。点 Ok。eclipse会自动下载依赖的jar包。
在这个过程中或许会报错,看到错误信息是因为org.restlet.jse包下载不到。解决方法是:ivyivy.xml中找到 <dependency org="org.restlet.jse" name="org.restlet" rev="2.0.5" conf="*->default" /> <dependency org="org.restlet.jse" name="org.restlet.ext.jackson" rev="2.0.5" conf="*->default" /> 部分,注释掉。在网上手动找到这两个包,放在lib包下,加入到Libaries中。
接着加入plugin文件夹下各个插件的ivy.xml文件。手动一个一个加进去。
步骤6:在"Order and Export"选项卡,将 conf top 步骤7:数据库配置以及其他配置信息 打开/conf/gora.properties ,删除文件中所有内容,写入mysql配置: ############################### # MySQL properties # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=123456
在/conf/gora-sql-mapping.xml 修改 <primarykey column="id" length="240"/> 在 /conf/nutch-site.xml输入: <property> <name>http.agent.name</name> <value>Your Nutch Spider</value> </property>
<property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property>
<property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property>
<property> <name>plugin.includes</name> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
<property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property>
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
在根目录下的build.xml中找到如下代码
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy"> <ivy:resolve file="${ivy.file}" conf="default" log="download-only" /> <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" /> <antcall target="copy-libs" /> </target> 将pattern="${build.lib.dir}/[artifact]-[revision].[ext]"替换为pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]" 步骤8:配置抓取url 在test项目下创建文件夹urls,在urls下创建文件seeds.txt ,写你要抓取的网站。我写的是http://www.163.com。 步骤9:运行org.apache.nutch.crawl.Crawler 打开Crawler文件,“Run As” -> “Run Configurations” ,在“Arguments”选项卡的“Program Arguments”,输入 “urls -depth 3 -topN 5”,点"Run"。哈哈,报错了吧。报错信息类似于“ Failed to set permissions of path: tmpHadoop-AdministratormapredstagingAdministrator1712398257. ”的错误。这是hadoop的一个问题。解决方法是,修改/hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue,注释掉即可。当然最简单的办法是在网上找一个修改过的包,替换一下FileUtil.class。 再次运行,哈哈 执行成功到此结束。
祝各位好运吧。
遇到的问题: 1 报 Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004 根据在网上查到的问题可能很多首先 nutch-default.xml 中配置 <name>plugin.folders</name><value>./src/plugin</value> 其次查找 hadoop.log文件。