php xPath 使用简单爬虫记录

简单爬虫记录
网站初期，需要快速上线，需要大量有质量的内容,需要采集。
采集需要知道的知识点
- php发起网络请求的相关的函数
  - file_get_contents
  - fscokopen
  - curl
- 其他
  - 正则/xpath
  - 了解html
  - http相关知识
下面写一个简单的php正则采集,以采集https://news.ke.com/bj/baike/0033/网站为例子
推荐大家使用curl发起网络请求,function.php文件http_request方法用于发起网络请求

代码语言：javascript复制

    <?php
    function http_request(string $url, $data = []) 
    {
        $ret = '';
        // 1、初始化
        $ch = curl_init();
        // 2、相关配置
        # 设置请求的URL地址
        curl_setopt($ch, CURLOPT_URL, $url);
        # 设置一下执行成功后不直接返回到客户端
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        # 设置超时时间  单位是秒
        curl_setopt($ch, CURLOPT_TIMEOUT, 30);
        # 不进行证书的检测
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
        # 伪造一个请求的浏览器型号
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3780.0 Safari/537.36');
        // 表示有请求体，是POST的提交
        if (!empty($data)) {
            # 指明是一个POST请求
            curl_setopt($ch, CURLOPT_POST, 1);
            curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
        }
        // 3、执行
        $ret = curl_exec($ch);
        # 请求的错误码 为0表示请求正确，大于0则表求请求失败的
        if (curl_errno($ch) > 0) {
            echo curl_error($ch);
            exit;
        }
        // 4、关闭请求资源
        curl_close($ch);
        return $ret;
    }
    ?>

建立01_spider.php

代码语言：javascript复制

    <?php
    include __DIR__.'./function.php';
    $url = 'https://news.ke.com/bj/baike/0033/';
    $html = http_request($url);
    //<title>北京房产政策百科_北京房产政策知识大全【北京贝壳找房】</title>
    //i 不区分大小写 U 禁止贪婪 s 忽略换行
    $preg = '#<title>(.*)</title>#iUs';
    preg_match_all($preg, $html, $arr);
    print_r($arr);
    /**
    结果：
    $ php spider/01_spider.php
    Array
    (
        [0] => Array
            (
                [0] => <title>北京房产政策百科_北京房产政策知识大全【北京贝壳找房】</title>
            )
    
        [1] => Array
            (
                [0] => 北京房产政策百科_北京房产政策知识大全【北京贝壳找房】
            )
    
    )
    */

下面写一个简单的php xpath采集.推荐使用谷歌浏览器，按以下操作获取到标题的xPath
比如我们要匹配一个标题 /html/body/div[3]/div[2]/div/div[2]/div[2]/div[1]/div/a 我们去掉a标签的父级div和父级的上级div以及a标签本身之后的xPath为/html/body/div[3]/div[2]/div/div[2]/div[2], 其含义为定位到了包含了整个列表的div即<div class="m-col"> </div> 因为包含整个列表的div 里面有很多a标签，我们要定位到只包含标题的a标签，发现只有标题的a标签有class="tit LOGCLICK" 所以我们这样写xpath为//*[@class="tit LOGCLICK"]/text(), 选取列表下所有带有属性为class="tit LOGCLICK"的text值。 把两个定位连起来就是完整的xPath /html/body/div[3]/div[2]/div/div[2]/div[2]//*[@class="tit LOGCLICK"]/text()

代码语言：javascript复制

    <?php
    
        include __DIR__.'./function.php';
        $url = 'https://news.ke.com/bj/baike/0033/';
        $html = http_request($url);
        $dom = new DOMDocument();//声明一个dom对象
        libxml_use_internal_errors(true);//忽略不严谨的html
        $dom->loadHTML($html);//加载html
        $dom->normalize();//使html规范化
        $xpath = new DOMXPath($dom);//用domXpath 加载dom 用与查询
        $query = '/html/body/div[3]/div[2]/div/div[2]/div[2]//*[@class="tit LOGCLICK"]/text()';
        $dOMNodeList = $xpath->query($query);
        foreach ($dOMNodeList as $item) {
            echo $item->nodeValue."n";
        }

结果：

代码语言：javascript复制

$ php spider/01_spider.php
    落户上学与商品房一致，共有产权房你能申请吗？购租并举下，北京租房能落户和上学吗？北京买房，你真的是首套吗？首套二套有啥区别？2018年北京住宅限购政策是什么？你的城市房租收入比是多少？北京公租房申请条件是怎么？怎么配租？北京积分落户初核结果可查，有异议可申请复核！买共有产权
    住房，能贷多少钱？共有产权房如何上市出售？购房资质审核时限缩短为1个工作日

代码语言：javascript复制

    $ php spider/01_title.php
    落户上学与商品房一致，共有产权房你能申请吗？
    购租并举下，北京租房能落户和上学吗？
    北京买房，你真的是首套吗？首套二套有啥区别？
    2018年北京住宅限购政策是什么？
    你的城市房租收入比是多少？
    北京公租房申请条件是怎么？怎么配租？
    北京积分落户初核结果可查，有异议可申请复核！
    买共有产权住房，能贷多少钱？
    共有产权房如何上市出售？
    购房资质审核时限缩短为1个工作日

同理获取文章封面图

代码语言：javascript复制

    <?php
    $query = '/html/body/div[3]/div[2]/div/div[2]/div[2]//img/@data-original';
    $dOMNodeList = $xpath->query($query);
    foreach ($dOMNodeList as $item) {
        echo $item->nodeValue."n";
    }

结果:

代码语言：javascript复制

$ php spider/01_spider.php
    https://img.yuanmabao.com/zijie/pic/2022/09/11/ii52502o5bh.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/amfs15jswei.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/unn4od3fh0y.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/1m3znsdfjwd.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/i2agapbvjon.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/hiwdh4bh4rl.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/jf4e4o4i0ti.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/4y40otleaqh.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/2jbcmgzzqgr.jpeg.230x175.jpg
    https://img.yuanmabao.com/zijie/pic/2022/09/11/o4ez2gkbkn1.jpeg.230x175.jpg

了解更多可以查看文档

html xslt&xpath php 命令行工具

0 人点赞