Xpath如何提取html标签(HTML标签和内容)

2021-06-15 15:05:55 浏览数 (3)

问题

(python,使用lxml Xpath) 需要提取HTML中一个div里所有内容(包括标签)

代码语言:javascript复制
<div>
   <table>
      <tr>
         <td class="td class">Row value 1</td>
         <td class="td class">Row value 2</td>
      </tr>
      <tr>
         <td class="td class">Row value 3</td>
         <td class="second td class">Row value 4</td>
      </tr>
      <tr>
         <td class="third td class">Row value 1</td>
         <td class="td class">Row value 1</td>
      </tr>
   </table>
</div>

如何把table标签提取出来,结果如下:

代码语言:javascript复制
<table>
  <tr>
     <td class="td class">Row value 1</td>
     <td class="td class">Row value 2</td>
  </tr>
  <tr>
     <td class="td class">Row value 3</td>
     <td class="second td class">Row value 4</td>
  </tr>
  <tr>
     <td class="third td class">Row value 1</td>
     <td class="td class">Row value 1</td>
  </tr>
</table>

解决方案

1

代码语言:javascript复制
from lxml import etree
div = etree.HTML(html)
table = div.xpath('//div/table')[0]
content = etree.tostring(table,print_pretty=True, method='html')  # 转为字符串

2

代码语言:javascript复制
from lxml.html import fromstring, tostring
# fromstring返回一个HtmlElement对象
# selector = fromstring(html)

selector = etree.HTML(html)
content = selector.xpath('//div/table')[0]
print(content)

# tostring方法即可返回原始html标签
original_html = tostring(content)

3

代码语言:javascript复制
BeautifulSoup的find

1 人点赞