大家好,我是不温卜火,是一名计算机学院大数据专业大三的学生,昵称来源于成语—
不温不火
,本意是希望自己性情温和
。作为一名互联网行业的小白,博主写博客一方面是为了记录自己的学习过程,另一方面是总结自己所犯的错误希望能够帮助到很多和自己一样处于起步阶段的萌新。但由于水平有限,博客中难免会有一些错误出现,有纰漏之处恳请各位大佬不吝赐教!暂时只在csdn这一个平台进行更新,博客主页:https://buwenbuhuo.blog.csdn.net/。
PS:由于现在越来越多的人未经本人同意直接爬取博主本人文章,博主在此特别声明:未经本人允许,禁止转载!!!
前两篇博文我们已经分别讲了js加密与css加密的爬虫,本篇博文我们继续实现base64加密的爬虫。 这里我们以爬安居客为例。那么在讲之前,我们首先需要了解base64加密及其基本原理。
推荐
♥各位如果想要交流的话,可以加下QQ交流群:974178910,里面有各种你想要的学习资料。♥
♥欢迎大家关注公众号【不温卜火】,关注公众号即可以提前阅读又可以获取各种干货哦,同时公众号每满1024及1024倍数则会抽奖赠送机械键盘一份 IT书籍1份哟~♥
一、base64加密的基本原理
1.1 Base64加密
- base64的编码都是按字符串长度,以每3个8bit的字符为一组,
- 然后针对每组,首先获取每个字符的ASCII编码,
- 然后将ASCII编码转换成8bit的二进制,得到一组3*8=24bit的字节
- 然后再将这24bit划分为4个6bit的字节,并在每个6bit的字节前面都填两个高位0,得到4个8bit的字节
- 然后将这4个8bit的字节转换成10进制,对照Base64编码表 (下表),得到对应编码后的字符。 (注:1. 要求被编码字符是8bit的,所以须在ASCII编码范围内,u0000-u00ff,中文就不行。 2. 如果被编码字符长度不是3的倍数的时候,则都用0代替,对应的输出字符为=)
此部分截取自叶落为重生的《关于base64编码的原理及实现》如果感兴趣的话,可以点开看看哦。
1.2 测试Base64加密的在线网站
链接:http://tool.chinaz.com/Tools/Base64.aspx
打开之后测试效果图如下:
二、网页分析与字体下载
安居客官网:https://bj.zu.anjuke.com/
我们首先看下当前请求对应的响应的内容:
我们接下来往下查看
发现字体部分是加密得到的,可以猜想到大概是css加密,下面我们先来尝试查看它的字体。
去style中找下这个字体的来源(点击左上方的
我们上次爬大众点评的时候,已经看过自定义字体的格式,如下所示:
代码语言:javascript复制@font-face {
font-family: "PingFangSC-Regular-address";
src: url("//s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/5a43c7ad.eot");
src: url("//s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/5a43c7ad.eot?#iefix") format("embedded-opentype"),url("//s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/5a43c7ad.woff");
}
.address {
font-family: 'PingFangSC-Regular-address';
}
发现,src:url(“字体的地址”),其实base64也可以将数据加密,直接使用"data:加密后的数据",这里的style分析发现,“data:application/font-ttf;charset=utf-8;base64,使用base64加密的数据”,这里可以通过正则找到数据。
在此先把此部分copy出来。
代码语言:javascript复制@font-face{font-family:'fangchan-secret';src:url('data:application/font-ttf;charset=utf-8;base64,AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8R/YwAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQa9/F7AAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QEQwahAAAKOAAAAEUAAQAABmb ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAN7vtapfDzz1AAsIAAAAAADbuFM1AAAAANu4UzUAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACAAGAAQABQAKAAIABwABAAMACQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAIAACVjwAAlY8AAAAGAACZPAAAmTwAAAAEAACaSwAAmksAAAAFAACeOgAAnjoAAAAKAACeowAAnqMAAAACAACfZAAAn2QAAAAHAACfkgAAn5IAAAABAACfpAAAn6QAAAADAACfpQAAn6UAAAAJAAAAAAAAACgAPgBmAJoAvgDoASQBOAF AboAAgAA/ YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez 6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j nwLGqgHButl0hI2wx43iv5D 69b pwQAAQAA/ YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63 SgX42uH 6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1 z 8wFhASClXv1Qo4eAoJeLhKQFRj7 ov7R1f762eP 3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8 gPvBcn6NwVgrQAAAwAA/ YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD 3wFT/un6zf7 AUwBnIJvaJLz P78/uGoh4OkAy B9avXyqD /osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX lf6l/lP MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAD/EwB3AAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA') format('truetype')}.strongbox{font-family:'fangchan-secret','Hiragino Sans GB','Microsoft yahei',Arial,sans-serif,'宋体'!important}
接下来发送请求,获取数据,提取base64数据
代码语言:javascript复制import requests
url = "https://bj.zu.anjuke.com/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
}
response = requests.get(url,headers=headers)
html = response.content.decode("utf-8")
print(html)
发现style中的字体是通过js来写的,这个不影响正则的提取,提取之后,使用base64解密,然后保存成ttf文件
代码语言:javascript复制import requests
import re
import base64
url = "https://bj.zu.anjuke.com/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
}
response = requests.get(url,headers=headers)
html = response.content.decode("utf-8")
data1 = re.findall(r"base64,(.*?)')",html,re.S)[0]
print(data1)
data2 = base64.b64decode(data1)
print(data2)
with open("./anjuke.ttf","wb") as file:
file.write(data2)
使用fontcreator打开查看:
再运行一次,再查看,对比:
通过对比,我们发现上面的编号每次是不同的,内容是一样的都为11个内容。
接着,使用fonttools工具读取ttf,获取编号和对应信息。 代码如下:
代码语言:javascript复制import requests
import re
import base64
from io import BytesIO
from fontTools.ttLib import TTFont
url = "https://bj.zu.anjuke.com/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
}
response = requests.get(url,headers=headers)
html = response.content.decode("utf-8")
data1 = re.findall(r"base64,(.*?)')",html,re.S)[0]
#base64解密
data2 = base64.b64decode(data1)
# with open("./anjuke.ttf","wb") as file:
# file.write(data2)
#字节读取
data3 = BytesIO(data2)
#读取字体
font = TTFont(data3)
#打印字体和对应
print(font.getGlyphOrder())
print(font.getBestCmap())
运行得到结果如下图:
复制下来。
代码语言:javascript复制['glyph00000', 'glyph00001', 'glyph00002', 'glyph00003', 'glyph00004', 'glyph00005', 'glyph00006', 'glyph00007', 'glyph00008', 'glyph00009', 'glyph00010']
{38006: 'glyph00008', 38287: 'glyph00005', 39228: 'glyph00003', 39499: 'glyph00002', 40506: 'glyph00010', 40611: 'glyph00004', 40804: 'glyph00007', 40850: 'glyph00001', 40868: 'glyph00006', 40869: 'glyph00009'}
发现规律:
'glyph00001‘
对应的是数字0,'glyph00002'
对应数字1…
38006是10进制,而使用ttf文件中上面的键是uni 16进制,这里将16和10进制进行转换就可以了。
下面我们以数字7为例:
三、代码实现
大体思路如下:
- 向https://bj.zu.anjuke.com/发送请求获取html数据
- 提取base64加密后的数据,base64解码
- 使用fonttool读取字体
- 从html数据中获取加密的数据,在自定义字体中获取原文字
由于此部分大体上与上一篇博文类似,因此直接给出代码。如果以后有时间的话,此处会给出详细步骤 -。-
代码语言:javascript复制# encoding: utf-8
'''
@author 李华鑫
@create 2020-10-13 10:03
Mycsdn:https://buwenbuhuo.blog.csdn.net/
@contact: 459804692@qq.com
@software: Pycharm
@file: 安居客.py
@Version:1.0
'''
import requests
import re
import base64
import csv
from io import BytesIO
from fontTools.ttLib import TTFont
from lxml import etree
class AnJuKeSpider:
def __init__(self, url):
self.url = url
self.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
}
self.html = ""
self.font_dict = {}
def parse_url(self, url, headers, params={}):
"""解析url,返回html"""
response = requests.get(url, headers=headers, params=params)
return response.content.decode("utf-8")
def parse_xpath(self, html):
"""使用xpath解析html,返回xpath对象"""
etree_obj = etree.HTML(html)
return etree_obj
def get_font_dict(self, html):
"""获取字典 {编号:文字}"""
# 正则提取
data1 = re.findall(r"base64,(.*?)')", html, re.S)[0]
# base64解密
data2 = base64.b64decode(data1)
# 字节读取
data3 = BytesIO(data2)
# 读取字体
font = TTFont(data3)
# 打印字体和对应
data4 = font.getBestCmap()
# 返回数据
return {hex(k)[2:]: str(int(v[5:].lstrip("0")) - 1) for k, v in data4.items()}
def parse_font(self, string):
"""获取对应的字体"""
return re.sub(r'(*[a-z0-9] ?*)',lambda x:self.font_dict[x.group(1).strip("*")],string)
def start(self):
"""主程序"""
self.html = self.parse_url(url=self.url,headers=self.headers)
self.font_dict = self.get_font_dict(html=self.html)
# 替换特殊字符,避免产生乱码一样的内容
self.html = re.sub(r"&#x(w ?);", r"*1*", self.html)
#使用xpath解析
xpath_obj = self.parse_xpath(html=self.html)
div_list = xpath_obj.xpath('//div[@class="zu-itemmod"]')
for div in div_list:
item = {}
item["title"] = self.parse_font(div.xpath("./div[1]/h3/a/b/text()")[0])
item["price"] = self.parse_font(div.xpath("./div[2]/p/strong/b/text()")[0])
self.save(item)
def save(self,item):
"""将数据保存到csv中"""
print("{}保存中...".format(item))
with open("./安居客.csv", "a", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(item.values())
if __name__ == '__main__':
url = "https://bj.zu.anjuke.com/"
AnJuKeSpider(url=url).start()
四、最终结果
美好的日子总是短暂的,虽然还想继续与大家畅谈,但是本篇博文到此已经结束了,如果还嫌不够过瘾,不用担心,我们下篇见!
好书不厌读百回,熟读课思子自知。而我想要成为全场最靓的仔,就必须坚持通过学习来获取更多知识,用知识改变命运,用博客见证成长,用行动证明我在努力。 如果我的博客对你有帮助、如果你喜欢我的博客内容,请
“点赞” “评论”“收藏”
一键三连哦!听说点赞的人运气不会太差,每一天都会元气满满呦!如果实在要白嫖的话,那祝你开心每一天,欢迎常来我博客看看。 码字不易,大家的支持就是我坚持下去的动力。点赞后不要忘了关注
我哦!