多进程爬取中国区MODIS NDVI

Python爬取MODIS NDVI

使用Python从NASA官网上爬取数据需要解决的问题就是认证问题，解决了认证问题爬取数据就很简单了。爬取所用网站为：https://lpdaac.usgs.gov/tools/data-pool/

解决认证问题

我们使用requests库进行爬取时，直接在header里面加账号密码会认证失败。这里我们使用NASA官网上代码解决认证问题，其中就是对requests中的session类进行了重写。认证需要使用Earthdata的账号。

下载代码

代码语言：javascript复制

import requests
import re
import os
from bs4 import BeautifulSoup
from multiprocessing import Pool,Manager
import pandas as pd

def download(session,url,path):
  response = session.get(url).content
  with open(path, mode='wb') as f:
    f.write(response)
  print(url.split('/')[-1],'：下载成功')

                         


china_tiles=['h25v03','h26v03','h23v04','h24v04','h25v04','h26v04','h27v04','h23v05','h24v05',
             'h25v05','h26v05','h26v05','h28v05','h25v06','h26v06','h27v06','h28v06','h29v06','h28v07']

url='https://e4ftl01.cr.usgs.gov/MOLA/MYD13Q1.061/'
out_path='/content/drive/MyDrive/my_code/爬虫/'

class SessionWithHeaderRedirection(requests.Session):

    AUTH_HOST = 'urs.earthdata.nasa.gov'

    def __init__(self, username, password):

        super().__init__()

        self.auth = (username, password)

   # Overrides from the library to keep headers when redirected to or from

   # the NASA auth host.

    def rebuild_auth(self, prepared_request, response):

        headers = prepared_request.headers

        url = prepared_request.url

        if 'Authorization' in headers:

            original_parsed = requests.utils.urlparse(response.request.url)

            redirect_parsed = requests.utils.urlparse(url)

            if (original_parsed.hostname != redirect_parsed.hostname) and 
                    redirect_parsed.hostname != self.AUTH_HOST and 
                    original_parsed.hostname != self.AUTH_HOST:

                del headers['Authorization']

        return

username ='替换为你的账号'
password='替换为你的密码'

session=SessionWithHeaderRedirection(username,password)

response = session.get(url)


root_content =BeautifulSoup(response.content).find_all('a') 

time_url=[]
for i in root_content:
  temp_url=re.findall(r'........../',i['href'])
  if len(temp_url)>0:
    time_url.append(os.path.join(url,temp_url[0]))

save_path='/content/drive/MyDrive/dataset/MYD13Q1'

pool=Pool(5)

for j in time_url:

  time_path=os.path.join(save_path,j.split('/')[-2])
  print(j.split('/')[-1])
  if not os.path.exists(time_path):
    os.makedirs(time_path)
  time_content=BeautifulSoup(session.get(j).content).find_all('a')
  for k in time_content:

    tile_name=re.findall(r'MYD13Q1.A[0-9]*.(?:h25v03|h26v03|h23v04|h24v04|h25v04|h26v04|h27v04|h23v05|h24v05|h25v05|h26v05|h26v05|h28v05|h25v06|h26v06|h27v06|h28v06|h29v06|h28v07).061.[0-9]*.hdfb',k['href'])

    if len(tile_name)>0:
      tile_url=os.path.join(j,tile_name[0])
      tile_path=os.path.join(time_path,tile_name[0])
      # download(session,tile_url,tile_path)
      pool.apply_async(func=download, args=(session,tile_url,tile_path,))
  
      
pool.close()
pool.join()

这段代码里面主要用到了对session类的重写，实现了账号认证问题。中国区域的MODIS Tile使用正则表达式进行选择，最后为了提高爬虫的效率使用的Python中的多进程。这里我们爬的产品是MYD13Q1产品，也就是250m空间分辨率的植被指数产品，整个中国区域的产品数据量还是比较大的，现在我也正在爬取，之后爬取完了可以给大家分享一下。

python 访问管理

0 人点赞