接到的工作内容是对某国企及其旗下控股公司进行漏洞挖掘还只要高危的,头疼。爱妻查上一查它控股四千多个公司,直接上python爬它吧! 首先bp抓包,分析一下数据包,看回包发现返回的数据是Unicode编码,所以思路大概有了:
- 把数据都爬取下来
- Unicode解码数据
- 正则表达式提取所需公司名
第一步:爬取数据
代码语言:javascript复制import time
from urllib.parse import unquote
import re
# 这里的cookie自己用bp抓取替换就ok了
header={
'Cookie':'',
'Sec-Ch-Ua':'"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'Sec-Ch-Ua-Mobile':'?0',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
'Accept':'application/json, text/plain, */*',
'Ymg_ssr':'1668755050538_1668842314729_newx3Oy/bZS3JmKFpuSXF0dFO4LitarO41oOAYMKwim4cWMEziXdEuJoQqC9Po8LnWo5xdt5QyC5SUq0hbg09nAW2K1O1NLJfYLrz3r5165KX/7gQEiIR50kz9mBZl08hCunvgRxyRAAwMXTzf25rjN4BpVmunVEUgBmHGR2d5nht Vzq1QbtBcEwic4HqBWMMGj90dLwILVd0tapplxu4J2lRAgEpW1yLNHPdgmYCA1BS4urb1LmCaUDTC7I8ToSDsexLbmlVuYoOmx 4IlzdZGWV51fl9B7gAktxPdg5qra2UZ9Y57 gJypVJXOtNgJRSL3JjP7XDgYo8bUtTEA6/4vTTYBJLA4CBJ7oXStz8=',
'X-Requested-With':'XMLHttpRequest',
'Zx-Open-Url':'https://aiqicha.baidu.com/company_detail_28684316400936',
'Sec-Ch-Ua-Platform':'Windows',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-Mode':'cors',
'Sec-Fetch-Dest':'empty',
'Referer':'https://aiqicha.baidu.com/company_detail_28684316400936',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh,en-US;q=0.9,en;q=0.8,zh-CN;q=0.7',
'Connection':'close',
}
def input_data(date):
with open("result.txt", mode="a ") as fd:
fd.write(date "n")
# first_step:爬取相关公司旗下控股公司名称,输出到当前目录的result,自己在用pycharm的正则表达式处理提取出来保存即可
def get_date():
try:
for i in range(1,451):
url = "https://aiqicha.baidu.com/detail/holdsAjax?pid=28684316400936&p={}&size=10&confirm=".format(i)
respond = requests.get(url=url,headers=header)
time.sleep(0.5)
input_data(respond.text)
print("爬取第{}条完毕,成功入库".format(i))
except Exception as err:
print(err)
if __name__ == "__main__":
get_date()
第二步:解码
这个我直接到到网上找的在线解密,直接丢进去就好了,pass~
第三步:正则提取
这里也是利用pycharm的正则表达式功能,ctrl r打开正则表达式功能,各位按需自己构造吧,pass~