基于spark的网络爬虫实现

2022-05-07 14:07:41 浏览数 (1)

爬虫是获取网络大数据的重要手段,爬虫是一种非常成熟的技术了,然而想着在spark环境下测试一下效果.

还是非常简单的,利用JavaSparkContext来构建,就可以采用原来java中的网页获取那一套来实现.

首先给定几个初始种子,生成一个JavaRDD对象即可         JavaRDD<String> rdd = sc.parallelize("urllist");

 JavaRDD<String> content = rdd.map(new Function<String, String>() {             public String call(String url) throws Exception {                 System.out.println(url);                 CloseableHttpClient client = null;                 HttpGet get = null;                 CloseableHttpResponse response = null;                 try {                     //## 创建默认连接                     client = HttpClients.createDefault();                     get = new HttpGet(url);                     response = client.execute(get);                     HttpEntity entity = response.getEntity();                     //## 获得输出字节流                     ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();                     entity.writeTo(byteArrayOutputStream);                     //## 转化为文档                     String html = new String(byteArrayOutputStream.toByteArray(), Charsets.UTF_8);                     Document document = Jsoup.parse(html);                     return html;                 } catch (Exception ex) {                     ex.printStackTrace();                     return "";                 } finally {                     if (response != null) {                         response.close();                     }                     if (client != null) {                         client.close();                     }                 }             }         });

当然可以从HTML再找到子页连接,继续以深度或者广度进行优先爬虫.

如输出http://docs.opencv.org/的文档如下:

0 人点赞