一、引言:
Heritrix3.x与Heritrix1.x版本差异比较大,全新配置模式的引入 扩展接口的变化,同时由于说明文档的匮乏,给Heritrix的开发者带来困惑,前面的文章已经就Heritrix的配置部署和运行做了说明,本文就Heritrix3.x版本就Extractor扩展做出实例说明。
二、配置说明
Heritrix3.x的WebUI发生了变化,不在是原来那种WebUI选择模式,而是变成了在线配置文件直接编辑模式。在这里自定义的Extractor要想加入Heritrix运行,首先需要修改配置文件,降自定义扩展的Extractor加入到Heritrix的Processor队列。完整配置文件如下所示:
2.1 配置文件
代码语言:javascript复制205 <!-- FETCH CHAIN -->
206 <!-- processors declared as named beans -->
207 <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
212 </bean>
213 <bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">
217 </bean>
218 <bean id="fetchDns" class="org.archive.modules.fetcher.FetchDNS">
222 </bean>
223 <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
249 </bean>
250 <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">
251 </bean>
-------------------------------自定义Extractor------------------------------------
252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor">
253 </bean>
---------------------------------------------------------------------------------
254 <bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML">
264 </bean>
265 <bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS">
266 </bean>
267 <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
268 </bean>
269 <bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF">
270 </bean>
271 <!-- assembled into ordered FetchChain bean -->
272 <bean id="fetchProcessors" class="org.archive.modules.FetchChain">
273 <property name="processors">
274 <list>
275 <!-- recheck scope, if so enabled... -->
276 <ref bean="preselector"/>
277 <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... -->
278 <ref bean="preconditions"/>
279 <!-- ...fetch if DNS URI... -->
280 <ref bean="fetchDns"/>
281 <!-- ...fetch if HTTP URI... -->
282 <ref bean="fetchHttp"/>
283 <!-- ...extract oulinks from HTTP headers... -->
284 <ref bean="extractorHttp"/>
----------------------------自定义Extractor----------------------------------------------
285 <!-- ...extract oulinks from HTTP content... -->
286 <ref bean="SohuNewsExtractor"/>
---------------------------------------------------------------------------------------
287 <!-- ...extract oulinks from HTML content... -->
288 <ref bean="extractorHtml"/>
289 <!-- ...extract oulinks from CSS content... -->
290 <ref bean="extractorCss"/>
291 <!-- ...extract oulinks from Javascript content... -->
292 <ref bean="extractorJs"/>
293 <!-- ...extract oulinks from Flash content... -->
294 <ref bean="extractorSwf"/>
295 </list>
296 </property>
297 </bean>
298
2.2 添加Bean和配置调度列表
代码语言:javascript复制250 <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">
251 </bean>
-------------------------------自定义Extractor------------------------------------
252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor">
253 </bean>
---------------------------------------------------------------------------------
...
----------------------------自定义Extractor---------------------------------------
285 <!-- ...extract oulinks from HTTP content... -->
286 <ref bean="SohuNewsExtractor"/>
---------------------------------------------------------------------------------
配置完成以上部分,既可以实现自定义Extractor参与Processor任务处理的调度。
三、程序说明
3.1 Extractor基类
Extractor基类发生了变化,新增了新的接口方法:
代码语言:javascript复制1 @Override
2 protected boolean shouldProcess(CrawlURI uri) {
3 // TODO Auto-generated method stub
4 return false;
5 }
如果不实现此方法,自定义扩展的Extractor的函数void extract(CrawlURI uri)将不会被调度。
3.2 构造函数
1.x版本的构造函数如下:
代码语言:javascript复制public Extractor(String name, String description) {
super(name, description);
// TODO Auto-generated constructor stub
}
3.x版本的构造函数取消了参数,采用的默认构造函数。
四、遗留问题
protected void extract(CrawlURI curi)
{
//1. 做哪些处理?
//2. 如何控制后续的下载行为,要求只下载自己想要的内容
}