文章背景:之前学习了BeautifulSoup模块(参见文末的延伸阅读
),在此基础上,通过输入大学排名URL链接,获得大学排名信息的屏幕输出。
定向爬虫:仅对输入URL进行爬取,不扩展爬取。
技术路线:requests-bs4
爬取网页:https://www.shanghairanking.cn/rankings/bcur/2020
1 程序的结构设计
- 从网络上获取大学排名网页内容 getHTMLText()
- 提取网页内容中的信息到合适的数据结构 fillUnivList()
- 利用数据结构展示并输出结果 printUnivList()
2 代码展示
代码语言:javascript复制import requests, bs4
from bs4 import BeautifulSoup
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except Exception as exc:
print('There was a problem: %s' % (exc))
def fillUnivList(ulist, html):
soup = BeautifulSoup(html, "html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr, bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].text.strip(), tds[1].text.strip(), tds[4].text.strip()])
def printUnivList(ulist, num):
tplt = "{0:^10}t{1:{3}^10}t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
uinfo = []
url = 'https://www.shanghairanking.cn/rankings/bcur/2020'
html = getHTMLText(url)
fillUnivList(uinfo, html)
printUnivList(uinfo, 10) # 10 univs
main()
运行结果:
参考资料:
[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)
[2] Python format 格式化函数(https://www.runoob.com/python/att-string-format.html)
[3] 字符串格式化不整齐与chr(12288)(https://blog.csdn.net/Heart_for_Ling/article/details/109247500)
[4] BeautifulSoup中 .string 返回None 和 .text使用(https://blog.csdn.net/lin252931/article/details/105403723)
延伸阅读:
[1] Python: BeautifulSoup库入门