1。系统准备 安装Ubuntu13.10,设置源,之后sudo apt-get update和sudo apt-get upgrade
2。相关软件准备 (1)安装ant sudo apt-get install ant1.7,检查安装情况ant -version出现
Apache Ant version 1.7.1 compiled on September 3 2011
表明安装成功。
(2)jdk安装配置 从官网下载jdk,解压到目录/opt/jdk
环境变量配置:sudo gedit /etc/profile文末添加内容
export Java_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
保存推出,source /etc/profile使配置生效。
检验:java -version和java均有内容(内容省了粘贴)
(3)nutch 下载nutch1.7,解压到/opt/nutch
cd /opt/nutch
bin/nutch 此时会出现用法帮助,表示安装成功了。下面进行相关配置。
step1:修改文件conf/nutch-site.xml,设置HTTP请求中agent的名字: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="https://www.linuxidc.com/Linux/2014-03/configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Friendly Crawler</value> </property> </configuration>
step2:创建种子文件夹 mkdir -p urls
step3:将种子URL写到文件urls/seed.txt中:sudo gedit seed.txt http://www.linuxidc.com
step4:配置 conf/regex-urlfilter.txt # accept anything else # .
# added by yoyo 36kr.com
step5:修改conf/nutch-site.xml,在里面增加一个parser.skip.truncated属性: <property> <name>parser.skip.truncated</name> <value>false</value> </property>
这是因为用tcpdump或者wireshark抓包发现,该网站的页面内容采用truncate的方式分段返回,而nutch的默认设置是不处理这种方式的,需要打开之, 参考:http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html
step6:爬取实验
bin/nutch crawl urls -dir crawl
(4)Solr安装 下载solr4.6,解压到/opt/solr
cd /opt/solr/example
java -jar start.jar
如能正常打开网页http://localhost:8983/solr/则说明成功。
3.Nutch与Solr集成 (1)环境变量设置: sudo gedit /etc/profile 添加
export NUTCH_RUNTIME_HOME=/opt/nutch
export APACHE_SOLR_HOME=/opt/solr
(2)集成 mkdir ${APACHE_SOLR_HOME}/example/solr/conf cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
重启solr:
java -jar start.jar
建立索引:
bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solrhttp://localhost:8983/solr/
出错:
Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication
Exception in thread "main" java.io.IOException: Job failed! at org.apache.Hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65) at org.apache.nutch.crawl.Crawl.run(Crawl.java:155) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
解决方法是参考http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch
类似的还有其他一些字段需要补充,方法是编辑 ~/solr-4.4.0/example/solr/collection1/conf/schema.xml,在<field>…</fields>中增加以下的字段: <fields> <field name="host" type="string" stored="false" indexed="true"/> <field name="digest" type="string" stored="true" indexed="false"/> <field name="segment" type="string" stored="true" indexed="false"/> <field name="boost" type="float" stored="true" indexed="false"/> <field name="tstamp" type="date" stored="true" indexed="false"/></fields>
(3)验证 rm crawl/ -Rf
bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -solrhttp://localhost:8983/solr/
…………
…………
CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2014-03-03 08:55:30, elapsed: 00:00:01 LinkDb: starting at 2014-03-03 08:55:30 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085430 LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085441 LinkDb: finished at 2014-03-03 08:55:31, elapsed: 00:00:01 Indexer: starting at 2014-03-03 08:55:31 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication
Indexer: finished at 2014-03-03 08:55:35, elapsed: 00:00:03 SolrDeleteDuplicates: starting at 2014-03-03 08:55:35 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ SolrDeleteDuplicates: finished at 2014-03-03 08:55:36, elapsed: 00:00:01 crawl finished: crawl 检索抓取到的内容,用浏览器打开 http://localhost:8983/solr/#/collection1/query ,点击Excute Query即可。