Python爬取MODIS NDVI
使用Python从NASA官网上爬取数据需要解决的问题就是认证问题,解决了认证问题爬取数据就很简单了。 爬取所用网站为:https://lpdaac.usgs.gov/tools/data-pool/
解决认证问题
我们使用requests库进行爬取时,直接在header里面加账号密码会认证失败。这里我们使用NASA官网上代码解决认证问题,其中就是对requests中的session类进行了重写。 认证需要使用Earthdata的账号。
下载代码
代码语言:javascript复制import requests
import re
import os
from bs4 import BeautifulSoup
from multiprocessing import Pool,Manager
import pandas as pd
def download(session,url,path):
response = session.get(url).content
with open(path, mode='wb') as f:
f.write(response)
print(url.split('/')[-1],':下载成功')
china_tiles=['h25v03','h26v03','h23v04','h24v04','h25v04','h26v04','h27v04','h23v05','h24v05',
'h25v05','h26v05','h26v05','h28v05','h25v06','h26v06','h27v06','h28v06','h29v06','h28v07']
url='https://e4ftl01.cr.usgs.gov/MOLA/MYD13Q1.061/'
out_path='/content/drive/MyDrive/my_code/爬虫/'
class SessionWithHeaderRedirection(requests.Session):
AUTH_HOST = 'urs.earthdata.nasa.gov'
def __init__(self, username, password):
super().__init__()
self.auth = (username, password)
# Overrides from the library to keep headers when redirected to or from
# the NASA auth host.
def rebuild_auth(self, prepared_request, response):
headers = prepared_request.headers
url = prepared_request.url
if 'Authorization' in headers:
original_parsed = requests.utils.urlparse(response.request.url)
redirect_parsed = requests.utils.urlparse(url)
if (original_parsed.hostname != redirect_parsed.hostname) and
redirect_parsed.hostname != self.AUTH_HOST and
original_parsed.hostname != self.AUTH_HOST:
del headers['Authorization']
return
username ='替换为你的账号'
password='替换为你的密码'
session=SessionWithHeaderRedirection(username,password)
response = session.get(url)
root_content =BeautifulSoup(response.content).find_all('a')
time_url=[]
for i in root_content:
temp_url=re.findall(r'........../',i['href'])
if len(temp_url)>0:
time_url.append(os.path.join(url,temp_url[0]))
save_path='/content/drive/MyDrive/dataset/MYD13Q1'
pool=Pool(5)
for j in time_url:
time_path=os.path.join(save_path,j.split('/')[-2])
print(j.split('/')[-1])
if not os.path.exists(time_path):
os.makedirs(time_path)
time_content=BeautifulSoup(session.get(j).content).find_all('a')
for k in time_content:
tile_name=re.findall(r'MYD13Q1.A[0-9]*.(?:h25v03|h26v03|h23v04|h24v04|h25v04|h26v04|h27v04|h23v05|h24v05|h25v05|h26v05|h26v05|h28v05|h25v06|h26v06|h27v06|h28v06|h29v06|h28v07).061.[0-9]*.hdfb',k['href'])
if len(tile_name)>0:
tile_url=os.path.join(j,tile_name[0])
tile_path=os.path.join(time_path,tile_name[0])
# download(session,tile_url,tile_path)
pool.apply_async(func=download, args=(session,tile_url,tile_path,))
pool.close()
pool.join()
这段代码里面主要用到了对session类的重写,实现了账号认证问题。中国区域的MODIS Tile使用正则表达式进行选择,最后为了提高爬虫的效率使用的Python中的多进程。 这里我们爬的产品是MYD13Q1产品,也就是250m空间分辨率的植被指数产品,整个中国区域的产品数据量还是比较大的,现在我也正在爬取,之后爬取完了可以给大家分享一下。