这里紧接着golang源码分析:爬虫colly(part I)继续讲解,我们看下colly最核心的文件colly.go
H,colly.go 中首先定义了,爬虫开发中用到的hook回调的函数类型
代码语言:javascript复制type CollectorOption func(*Collector)
参数Collector结构体定义如下,定义了爬虫的各种参数,每一个参数对应的注释都非常详细,这里就不再翻译了
代码语言:javascript复制type Collector struct {
// UserAgent is the User-Agent string used by HTTP requests
UserAgent string
// MaxDepth limits the recursion depth of visited URLs.
// Set it to 0 for infinite recursion (default).
MaxDepth int
// AllowedDomains is a domain whitelist.
// Leave it blank to allow any domains to be visited
AllowedDomains []string
// DisallowedDomains is a domain blacklist.
DisallowedDomains []string
// DisallowedURLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request will be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
DisallowedURLFilters []*regexp.Regexp
// URLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request won't be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
URLFilters []*regexp.Regexp
// AllowURLRevisit allows multiple downloads of the same URL
AllowURLRevisit bool
// MaxBodySize is the limit of the retrieved response body in bytes.
// 0 means unlimited.
// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
MaxBodySize int
// CacheDir specifies a location where GET requests are cached as files.
// When it's not defined, caching is disabled.
CacheDir string
// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
// the target host's robots.txt file. See http://www.robotstxt.org/ for more
// information.
IgnoreRobotsTxt bool
// Async turns on asynchronous network communication. Use Collector.Wait() to
// be sure all requests have been finished.
Async bool
// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
// to true to enable it.
ParseHTTPErrorResponse bool
// ID is the unique identifier of a collector
ID uint32
// DetectCharset can enable character encoding detection for non-utf8 response bodies
// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
DetectCharset bool
// RedirectHandler allows control on how a redirect will be managed
// use c.SetRedirectHandler to set this value
redirectHandler func(req *http.Request, via []*http.Request) error
// CheckHead performs a HEAD request before every GET to pre-validate the response
CheckHead bool
// TraceHTTP enables capturing and reporting request performance for crawler tuning.
// When set to true, the Response.Trace will be filled in with an HTTPTrace object.
TraceHTTP bool
// Context is the context that will be used for HTTP requests. You can set this
// to support clean cancellation of scraping.
Context context.Context
store storage.Storage
debugger debug.Debugger
robotsMap map[string]*robotstxt.RobotsData
htmlCallbacks []*htmlCallbackContainer
xmlCallbacks []*xmlCallbackContainer
requestCallbacks []RequestCallback
responseCallbacks []ResponseCallback
responseHeadersCallbacks []ResponseHeadersCallback
errorCallbacks []ErrorCallback
scrapedCallbacks []ScrapedCallback
requestCount uint32
responseCount uint32
backend *httpBackend
wg *sync.WaitGroup
lock *sync.RWMutex
}
下面就是collect对应函数
代码语言:javascript复制func (c *Collector) Init()
func (c *Collector) Appengine(ctx context.Context)
func (c *Collector) Visit(URL string) error
func (c *Collector) HasVisited(URL string) (bool, error)
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error)
func (c *Collector) Head(URL string) error
func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error
func (c *Collector) UnmarshalRequest(r []byte) (*Request, error)
func (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error
func setRequestBody(req *http.Request, body io.Reader)
func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error
func (c *Collector) requestCheck(u string, parsedURL *url.URL, method string, requestData io.Reader, depth int, checkRevisit bool) error
func (c *Collector) checkRobots(u *url.URL) error
func (c *Collector) OnRequest(f RequestCallback){
c.requestCallbacks = make([]RequestCallback, 0, 4)
}
func (c *Collector) OnResponseHeaders(f ResponseHeadersCallback){
c.responseHeadersCallbacks = append(c.responseHeadersCallbacks, f)
}
func (c *Collector) handleOnRequest(r *Request) {
if c.debugger != nil {
c.debugger.Event(createEvent("request", r.ID, c.ID, map[string]string{
"url": r.URL.String(),
}))
}
for _, f := range c.requestCallbacks {
f(r)
}
}
func (c *Collector) handleOnHTML(resp *Response) error
其中我们经常用到的有下面几个:
代码语言:javascript复制func (c *Collector) Visit(URL string) error {
return c.scrape(URL, "GET", 1, nil, nil, nil, true)
}
这个是爬虫启动的入口,它调用了scrape函数
代码语言:javascript复制 func (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error{
go c.fetch(u, method, depth, requestData, ctx, hdr, req)
return c.fetch(u, method, depth, requestData, ctx, hdr, req)
}
它调用了fetch函数
代码语言:javascript复制func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error
c.handleOnRequest(request)
c.handleOnResponseHeaders(&Response{Ctx: ctx, Request: request, StatusCode: statusCode, Headers: &headers})
err := c.handleOnError(response, err, request, ctx); err != nil
c.handleOnResponse(response)
err = c.handleOnHTML(response)
c.handleOnScraped(response)
在fetch函数中调用了我们注册的回调函数,这里就是hook点
接下来定义了hook的一系列别名
代码语言:javascript复制 // RequestCallback is a type alias for OnRequest callback functions
type RequestCallback func(*Request)
代码语言:javascript复制// ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions
type ResponseHeadersCallback func(*Response)
代码语言:javascript复制// ResponseCallback is a type alias for OnResponse callback functions
type ResponseCallback func(*Response)
代码语言:javascript复制// HTMLCallback is a type alias for OnHTML callback functions
type HTMLCallback func(*HTMLElement)
代码语言:javascript复制// XMLCallback is a type alias for OnXML callback functions
type XMLCallback func(*XMLElement)
代码语言:javascript复制// ErrorCallback is a type alias for OnError callback functions
type ErrorCallback func(*Response, error)
代码语言:javascript复制// ScrapedCallback is a type alias for OnScraped callback functions
type ScrapedCallback func(*Response)
代码语言:javascript复制// ProxyFunc is a type alias for proxy setter functions.
type ProxyFunc func(*http.Request) (*url.URL, error)
envMap存储了环境变量,也就是我们启动爬虫前的一些列设置
代码语言:javascript复制 var envMap = map[string]func(*Collector, string)
ALLOWED_DOMAINS
CACHE_DIR
在爬虫初始化的过程中,运行完optionsfunc 进行设置后,会解析这些环境变量
代码语言:javascript复制 func NewCollector(options ...CollectorOption) *Collector
c.Init()
for _, f := range options {
f(c)
}
c.parseSettingsFromEnv()
I,context.go定义了context
代码语言:javascript复制type Context struct {
contextMap map[string]interface{}
lock *sync.RWMutex
}
J,htmlelement.go定义一些解析html常用的方法
代码语言:javascript复制type HTMLElement struct {
// Name is the name of the tag
Name string
Text string
attributes []html.Attribute
// Request is the request object of the element's HTML document
Request *Request
// Response is the Response object of the element's HTML document
Response *Response
// DOM is the goquery parsed DOM object of the page. DOM is relative
// to the current HTMLElement
DOM *goquery.Selection
// Index stores the position of the current element within all the elements matched by an OnHTML callback
Index int
}
代码语言:javascript复制func (h *HTMLElement) Attr(k string) string
func (h *HTMLElement) ChildText(goquerySelector string) string
func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string
K,http_backend.go定义了,用户设置的一些限制性条件
代码语言:javascript复制type httpBackend struct {
LimitRules []*LimitRule
Client *http.Client
lock *sync.RWMutex
}
代码语言:javascript复制type LimitRule struct {
// DomainRegexp is a regular expression to match against domains
DomainRegexp string
// DomainGlob is a glob pattern to match against domains
DomainGlob string
// Delay is the duration to wait before creating a new request to the matching domains
Delay time.Duration
// RandomDelay is the extra randomized duration to wait added to Delay before creating a new request
RandomDelay time.Duration
// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
Parallelism int
waitChan chan bool
compiledRegexp *regexp.Regexp
compiledGlob glob.Glob
}
代码语言:javascript复制func (r *LimitRule) Match(domain string) bool
func (h *httpBackend) Do(request *http.Request, bodySize int, checkHeadersFunc checkHeadersFunc) (*Response, error)
res, err := h.Client.Do(request)
L,http_trace.go定义了trace相关的数据
代码语言:javascript复制type HTTPTrace struct {
start, connect time.Time
ConnectDuration time.Duration
FirstByteDuration time.Duration
}
M,request.go定义了请求相关的数据
代码语言:javascript复制type Request struct {
// URL is the parsed URL of the HTTP request
URL *url.URL
// Headers contains the Request's HTTP headers
Headers *http.Header
// Ctx is a context between a Request and a Response
Ctx *Context
// Depth is the number of the parents of the request
Depth int
// Method is the HTTP method of the request
Method string
// Body is the request body which is used on POST/PUT requests
Body io.Reader
// ResponseCharacterencoding is the character encoding of the response body.
// Leave it blank to allow automatic character encoding of the response body.
// It is empty by default and it can be set in OnRequest callback.
ResponseCharacterEncoding string
// ID is the Unique identifier of the request
ID uint32
collector *Collector
abort bool
baseURL *url.URL
// ProxyURL is the proxy address that handles the request
ProxyURL string
}
N,response.go定义了对应的响应
代码语言:javascript复制type Response struct {
// StatusCode is the status code of the Response
StatusCode int
// Body is the content of the Response
Body []byte
// Ctx is a context between a Request and a Response
Ctx *Context
// Request is the Request object of the response
Request *Request
// Headers contains the Response's HTTP headers
Headers *http.Header
// Trace contains the HTTPTrace for the request. Will only be set by the
// collector if Collector.TraceHTTP is set to true.
Trace *HTTPTrace
}
O,unmarshal.go定义了html的反序列化方法
代码语言:javascript复制func UnmarshalHTML(v interface{}, s *goquery.Selection, structMap map[string]string) error
P,xmlelement.go
代码语言:javascript复制 type XMLElement struct {
// Name is the name of the tag
Name string
Text string
attributes interface{}
// Request is the request object of the element's HTML document
Request *Request
// Response is the Response object of the element's HTML document
Response *Response
// DOM is the DOM object of the page. DOM is relative
// to the current XMLElement and is either a html.Node or xmlquery.Node
// based on how the XMLElement was created.
DOM interface{}
isHTML bool
}
代码语言:javascript复制func (h *XMLElement) ChildText(xpathQuery string) string
总结下:colly一个爬虫基本的基本素:抓取数据的任务队列,抓去结果的解析,本地的存储。可以任务爬虫是一个更复杂的http客户端,但是colly通过options func 加事件 hook的方式,抽象简化了爬虫的逻辑,用可以很方便地定义可选参数和hook任务处理,快速地实现一个爬虫。