问题
(python,使用lxml Xpath) 需要提取HTML中一个div里所有内容(包括标签)
代码语言:javascript
复制<div>
<table>
<tr>
<td class="td class">Row value 1</td>
<td class="td class">Row value 2</td>
</tr>
<tr>
<td class="td class">Row value 3</td>
<td class="second td class">Row value 4</td>
</tr>
<tr>
<td class="third td class">Row value 1</td>
<td class="td class">Row value 1</td>
</tr>
</table>
</div>
如何把table标签提取出来,结果如下:
代码语言:javascript
复制<table>
<tr>
<td class="td class">Row value 1</td>
<td class="td class">Row value 2</td>
</tr>
<tr>
<td class="td class">Row value 3</td>
<td class="second td class">Row value 4</td>
</tr>
<tr>
<td class="third td class">Row value 1</td>
<td class="td class">Row value 1</td>
</tr>
</table>
解决方案
1
代码语言:javascript
复制from lxml import etree
div = etree.HTML(html)
table = div.xpath('//div/table')[0]
content = etree.tostring(table,print_pretty=True, method='html') # 转为字符串
2
代码语言:javascript
复制from lxml.html import fromstring, tostring
# fromstring返回一个HtmlElement对象
# selector = fromstring(html)
selector = etree.HTML(html)
content = selector.xpath('//div/table')[0]
print(content)
# tostring方法即可返回原始html标签
original_html = tostring(content)
3
代码语言:javascript
复制BeautifulSoup的find