Scrapy使用简介_2

0x00 前言

继续介绍Scrapy的其他用法,丰富一下各种姿势

0x01 Scrapy里面的request对象

request对象是在Scrapy编写中经常要用到的,你要发送一个请求给调度器爬娶,就必须构造request对象,这个对象里面有各种属性可以使用,例如设置request的header,cookie等,下面是request基础参数:

  • url —— 请求的url

  • callback —— 请求回来的reseponse处理函数,也叫回调函数

  • headers —— 页面的headers数据

  • cookies —— 设置页面的cookies,下面是一些小例子。这里没有设置callback,Scrapy会默认调用parse()函数作为callback函数传入

#传入多个键值对:列表
request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])
#传入单一键值对:字典
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'})
  • meta —— 最神奇的参数,它是一个字典,可以用来做页面间传值,但同时又一大堆重要的健值,适当设置可以精细化你的request,先看看如何传值

我对前面的猫眼做了一点改写,把保存了数据的item对象传了meta字典保存:meta = {“key” : item} ,然后通过这个新的Request对象调用的新的url,传到新的callback:parse2 ,然后取出meta里面的数据交给pipeline处理,这样我就实现了数据在页面间传递了

def parse(self, response):

nodelist = response.xpath("//div[@class='board-item-main']")
for node in nodelist:
item = MaoyanItem()
item['name'] = node.xpath(".//a/text()").extract()[0]
item['actors'] = node.xpath(".//p[@class='star']/text()").extract()[0].split()
integer = node.xpath(".//i[@class='integer']/text()").extract()[0]
fraction = node.xpath(".//i[@class='fraction']/text()").extract()[0]
item['score'] = integer+fraction
if self.offset < 10:
self.offset += 10
url = self.base_url + str(self.offset)
# url = response.urljoin(str(self.offset))
#yield response.follow(str(self.offset),callback = self.parse2 ,meta = {"key" : item},dont_filter = True)
yield scrapy.Request(url, callback = self.parse2 ,meta = {"key" : item},dont_filter = True)

def parse2(self,response):

item = response.meta["key"]
yield item

官方解释meta:

A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.

See Request.meta special keys for a list of special meta keys recognized by Scrapy.

This dict is shallow copied when the request is cloned using the copy() or replace() methods, and can also be accessed, in your spider, from the response.meta attribute.

其他详细的meta特殊字典健值可以参考下面的官方链接:

https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-meta

0x02 Request对象的源码解读


class Request(object_ref):

def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None):

self._encoding = encoding # this one has to be set first
self.method = str(method).upper()
self._set_url(url)
self._set_body(body)
assert isinstance(priority, int), "Request priority not an integer: %r" % priority
self.priority = priority

assert callback or not errback, "Cannot use errback without a callback"
self.callback = callback
self.errback = errback

self.cookies = cookies or {}
self.headers = Headers(headers or {}, encoding=encoding)
self.dont_filter = dont_filter

self._meta = dict(meta) if meta else None
self.flags = [] if flags is None else list(flags)

看到很多对应的东西,例如meta的初始化处理,cookies和headers接受的格式等,建议仔细阅读

0x03 Scrapy里面的response对象

  • 基础参数

url——请求的url
body——请求回来的html
meta——用来在“页面”之间传递数据
headers——页面的headers数据
cookies——设置页面的cookies
Request——发出这个response的request对象

基本和Request对象一一对应,主要介绍一下两个新的方法,

urljoin():将页面相对路径改为绝对路径

follow():对相对路径进行自动补全,构造出绝对路径,可以说是URLjoin的升级版,因为这个方法直接返回Request对象

if self.offset < 10:
self.offset += 10
url = response.urljoin(str(self.offset)) #利用urljoin()构造绝对路径,再传入Request

yield scrapy.Request(url, callback = self.parse2 ,meta = {"key" : item},dont_filter = True)

#直接利用response的follow()方法构造Request

yield response.follow(str(self.offset),callback = self.parse2 ,meta = {"key" : item},dont_filter = True)

refer:

https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response