当前位置：首页 > news >正文

Scrapy框架（高效爬虫）

news 2025/7/6 16:46:30

文章目录

一、环境配置
二、创建项目
三、scrapy数据解析
四、基于终端指令的持久化存储
- 1、基于终端指令
- 2、基于管道
- 3、数据同时保存至本地及数据库
- 4、基于spider爬取某网站各页面数据
- 5、爬取本页和详情页信息（请求传参）
- 6、图片数据爬取ImagesPipeline
五、中间件
- 1、拦截请求中间件（UA伪装，代理IP）
- 2、拦截响应中间件(动态加载)
六、CrawlSpider（自动请求全站爬取，全部页面，自动下拉滚轮爬取）
七、分布式爬虫
八、增量式爬虫

Scrapy拥有高性能持久化存储，异步数据下载，高性能数据解析，分布式功能

一、环境配置

环境配置步骤如下（要按步骤来）：

pip install wheel
下载twisted：https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
安装twisted：pip install Twisted-17.1.0-cp36m-win_amd64.whl   (这个文件的路劲)
pip install pywin32
pip install scrapy
测试：在终端输入scrapy指令，没有报错表示安装成功

二、创建项目

步骤：

1、打开pycharm的terminal
2、scrapy startproject first
3、cd first
4、scrapy genspider main www.xxx.com
5、需要有main.py里面的输出，则修改settings.py里面的ROBOTSTXT_OBEY = True改为False
6、scrapy crawl main不需要额外的输出则执行scrapy crawl main --nolog或者在settings.py里面添加LOG_LEVEL='ERROR'，main.py有错误代码会报错（不添加有错误时则不会报错）（常用）

实际操作如下：

打开pycharm的terminal
执行scrapy startproject first，当前目录下会出现个first项目工程，里面有个spiders文件夹，称为爬虫文件夹，在这里放爬虫源文件
在这里插入图片描述

cd first 进入工程目录
执行scrapy genspider main www.xxx.com，在spiders目录中创建一个名为main的爬虫文件，创建的文件自带部分内容
在这里插入图片描述

执行工程：scrapy crawl main (运行main爬虫文件）
在这里插入图片描述
上面parse里面如果有输出运行时没有输出，则需要将settings.py里面的ROBOTSTXT_OBEY = True改为False

改成False后，修改main.py文件内容如下，运行就会有输出

import scrapyclass MainSpider(scrapy.Spider):# 爬虫文件的名称：爬虫源文件的唯一标识name = "main"# 允许的域名：用来限定start_urls列表中哪些url可以进行请求发送,一般不用# allowed_domains = ["www.xxx.com"]# 起始的url列表：该列表中存放的url会被scrapy自动进行请求发送start_urls = ["http://www.baidu.com/","https://www.sogou.com"]# 用于数据解析：response参数表示的就是请求成功后的响应对象def parse(self, response):print(response)pass

在这里插入图片描述使用指令scrapy crawl main --nolog，输出内容就不会输出多余的内容，只输出打印的内容

在这里插入图片描述

如果里面的代码输入错误，运行加上–nolog后不会有输出，看不出报错
有个好方法，运行不用加–nolog，在settings.py里面添加LOG_LEVEL='ERROR'，如果输入错误，就会报ERROR，如果没错，输出内容和上面运行带上–nolog一样，这里就不用加–nolog
在这里插入图片描述

三、scrapy数据解析

首先创建项目

scrapy startproject first
cd first
scrapy genspider main www.xxx.com

在这里插入图片描述

修改settings.py文件的ROBOTSTXT_OBEY = True改为False，才会有输出，添加LOG_LEVEL='ERROR'
在将注释的USER_AGENT取消注释，在页面中复制该内容到这个变量中
在这里插入图片描述

在这里插入图片描述
修改main.py文件

import scrapyclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.xxx.com"]start_urls = ["http://www.baidu.com/"]def parse(self, response):# 解析数据# xpath返回的是列表，列表元素一定是Selector类型的对象div_list = response.xpath('//div')for div in div_list:# extract可以将selector对象中data参数存储的字符串提取出来(可以是列表)print(div.extract())

执行scrapy crawl main 就会将解析的数据输出
关于Xpath使用请看：https://blog.csdn.net/weixin_46287157/article/details/116432393

四、基于终端指令的持久化存储

1、基于终端指令

只可以将parse方法的返回值存储到本地的文本文件中
基于上面解析的数据，将要保存的数据保存在datas中，然后return返回
再执行指令scrapy crawl main -o ./a.csv，就可以将返回的datas数据保存在a.csv中（要加-o参数保存文件，也可以./a,txt）
注意：持久化存储的文本类型只可以为json, jsonlines, jsonl, jl, csv, xml, marshal, pickle

import scrapyclass MainSpider(scrapy.Spider):name = "main"allowed_domains = ["www.baidu.com"]start_urls = ["http://www.baidu.com/"]def parse(self, response):datas = []# 解析数据# xpath返回的是列表，列表元素一定是Selector类型的对象div_list = response.xpath('//div[@id="..."]/div')for div in div_list:# extract可以将selector对象中data参数存储的字符串提取出来(可以是列表)datas.append(div.xpath('./div').extract())return datas

2、基于管道

首先需要创建项目：

1、打开pycharm的terminal
2、scrapy startproject first
3、cd first
4、scrapy genspider main www.xxx.com
5、修改settings.py里面的ROBOTSTXT_OBEY = True改为False并添加LOG_LEVEL='ERROR'
6、scrapy crawl main

接着需要在主函数（main.py文件）中进行数据解析，main.py内容如下

import scrapy
from first.items import FirstItemclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.xxx.com"]start_urls = ["https://www.gushiwen.cn/"]def parse(self, response):# 解析数据# xpath返回的是列表，列表元素一定是Selector类型的对象div_list = response.xpath('/html/body/div[2]/div[1]/div')for div in div_list:# extract可以将selector对象中data参数存储的字符串提取出来(可以是列表)# 解析诗歌标题和内容# 加个if判断，如果解析到的不为空，就进行存储if len(div.xpath('./div[1]/p[1]/a/b/text()')) != 0 and len(div.xpath('./div[1]/div[2]/text()')) !=0:# 解析数据title = div.xpath('./div[1]/p[1]/a/b/text()')[0].extract()content = div.xpath('./div[1]/div[2]/text()')[0].extract()

然后在item类中定义相关的属性，将解析的数据存储到item类型的对象（items.py里面有个FirstItem类，可以实例化对象，即item类的实例对象）。修改items.py文件，定义相关属性，比如诗歌、内容，将解析到的数据封装到item对象中：

class FirstItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()content = scrapy.Field()pass

用item实例化对象后，这个对象就能取得title和content。将item类型的对象提交给管道进行持久化存储操作。pipelines.py中定义了FirstPipeline类，专门用来处理item类型对象。在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass FirstPipeline:fp = None# 重写父类的一个方法：该方法只在开始爬虫的时候被调用一次def open_spider(self, spider):print('开启爬虫')self.fp = open('./a.txt', 'w', encoding='utf-8')# 用来处理item类型对象# 该方法可以接收爬虫文件提交过来的item对象# 该方法每接收到一个人item就会被调用一次# 如果把打开关闭文件放这，则需要多次打开关闭文件，则另外创建函数，减少打开关闭文件次数def process_item(self, item, spider): # item为item对象title = item['title']content = item['content']self.fp.write(title+':'+content+'\n')return itemdef close_spider(self, spider):print('结束爬虫')self.fp.close()

在配置文件setting.py中开启管道，即取消管道的注释

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"first.pipelines.FirstPipeline": 300,
}

最后在主函数main.py中添加调库及将解析的数据存储到item类型的对象和将item提交给管道的代码

import scrapy
from first.items import FirstItemclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.xxx.com"]start_urls = ["https://www.gushiwen.cn/"]def parse(self, response):# 解析数据# xpath返回的是列表，列表元素一定是Selector类型的对象div_list = response.xpath('/html/body/div[2]/div[1]/div')for div in div_list:# extract可以将selector对象中data参数存储的字符串提取出来(可以是列表)# 解析诗歌标题和内容# 加个if判断，如果解析到的不为空，就进行存储if len(div.xpath('./div[1]/p[1]/a/b/text()')) != 0 and len(div.xpath('./div[1]/div[2]/text()')) !=0:title = div.xpath('./div[1]/p[1]/a/b/text()')[0].extract()content = div.xpath('./div[1]/div[2]/text()')[0].extract()# 将解析的数据存储到item类型的对象item = FirstItem()item['title'] = titleitem['content'] = content# 将item提交给管道，进行数据存储yield item

最后运行程序：scrapy crawl main

3、数据同时保存至本地及数据库

举例：将爬取的数据一份存到本地，一份存到数据库
在上面代码的基础上，再在管道中定义多个管道类，一个类制定存储到某个地方（一个存储到本地，一个存储到数据库）
在piplines.py中新增如下类

# 导入数据库
# import pymysql
# 管道文件中一个管道类对应将一组数据存储到一个平台或者载体中
class mysqlPileLine(object):conn = Nonecursor = Nonedef open_spider(self, spider):self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='123456', db='qiubai', charset='utf8')def process_item(self, item, spider):self.cursor =self.conn.cursor()try:self.cursor.execute('insert into qiubai values("%s","%s")'%(item["title"],item["content"]))self.conn.commit()except Exception as e:pritn(e)self.conn.rollback()return itemdef close_spider(self, spider):self.cursor.close()self.conn.close()

在settings.py中开启管道，添加个优先级

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {# 300表示优先级，数值越小，优先级越高"first.pipelines.FirstPipeline": 300,"first.pipelines.mysqlPileLine": 301,
}

爬虫文件提交的item类型的对象最终会提交给先执行的类（优先级高的类）。在优先级高的里面添加return item，return就会传递给下一个即将执行的管道类，pipelines.py优先级高的类中给一个函数添加return

    def process_item(self, item, spider): # item为item对象title = item['title']content = item['content']self.fp.write(title+':'+content+'\n')# return就会传递给下一个即将执行的管道类return item

管道文件中一个管道类对应的是将数据存储到一个平台。爬虫文件提交的item只会给管道文件中第一个被执行的管道类接收（优先级高的）。process_item中的return item表示将item传递给下一个即将被执行的管道类

4、基于spider爬取某网站各页面数据

以某网站为例，对某网站的全部页面（有分页栏）对应的页面进行爬取，这里需要获取所有页面的URL，并将其加入到start_url中，然后进行请求发送

首先创建项目并初始化操作

1、打开pycharm的terminal
2、scrapy startproject first
3、cd first
4、scrapy genspider main www.xxx.com
5、修改settings.py里面的ROBOTSTXT_OBEY = True改为False并添加LOG_LEVEL='ERROR'
6、scrapy crawl main  (最后一步运行）

然后编辑main.py文件

import scrapyclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.xxx.com"]# 第一个页面start_urls = ["https://www.woyaogexing.com/touxiang/katong/new/index.html"]# 生成一个通用的url# %d可以被page_num取代url = 'https://www.woyaogexing.com/touxiang/katong/new/index_%d.html'# 页码数page_num = 2def parse(self, response):# 解析数据div_list = response.xpath('/html/body/div[3]/div[3]/div[1]/div[2]/div')for li in div_list:img_name = li.xpath('./a[2]/text()')[0].extract()print(img_name)# 递归if self.page_num <= 5:  # 前5页# 每个页码对应的url，page_num替换url中的%dnew_url = format(self.url%self.page_num)self.page_num += 1# 手动请求发送：callback回调函数是专门用于数据解析yield scrapy.Request(url=new_url, callback=self.parse)

5、爬取本页和详情页信息（请求传参）

爬取本页面信息，然后再爬取详情页面信息，即本页面点击进入详情页之后爬取详细信息，这需要解析当前页获取详情页URL，然后对详情页URL发起请求并解析，然后存储。在scrapy.Request中传入meta={‘item’:item}参数，即传入item

import scrapy
from first.items import FirstItemclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.baidu.com"]# 第一页面URLstart_urls = ["https://www.qidian.com/rank/yuepiao/"]# 后续页面URLurl = 'https://www.qidian.com/rank/yuepiao/year2023-month03-page%d/'page_num = 2# 回调函数接收item，解析详情页def parse_detail(self,response):item = response.meta['item']# 这里解析详情页信息content = response.xpath('...')item['content'] = contentprint(item['title'], item['content'])yield item# 解析首页的岗位名称def parse(self, response):li_list = response.xpath('/html/body/div[1]/div[6]/div[2]/div[2]/div/div/ul/li')for li in li_list:item = FirstItem()title = li.xpath('./div[2]/h2/a/text()')[0].extract()item['title'] = titledetail_url = 'https:' + li.xpath('./div[2]/h2/a/@href')[0].extract()# 对详情页发送请求获取详情页的信息源码数据# 手动请求的发送# 请求传参：meta={}，可以将meta字典传递给请求对应的回调函数yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})# 分页操作，爬取前5页数据if self.page_num <= 5:new_url = format(self.url%self.page_num)self.page_num += 1yield scrapy.Request(new_url, callback=self.parse)

6、图片数据爬取ImagesPipeline

基于scrapy爬取字符串类型的数据后可以直接提交给管道进行存储，但爬取图片需要解析图片的src，然后单独对图片的地址发起请求后去图片二进制类型的数据。pipelines.py里面的类只需要将img的src的属性值进行解析，提交到管道，管道就会对图片的src进行请求发送获取图片的二进制类型的数据，且还会进行持久化存储

在配置文件中添加文件存储目录，修改settings.py

#修改
ITEM_PIPELINES = {# 300表示优先级，数值越小，优先级越高"first.pipelines.imgPipeline": 300,
}# 在配置文件末尾添加文件存储目录
IMAGES_STORE = './imgs'

将pipelines.py文件内容全部删除，修改为如下内容

from scrapy.pipelines.images import ImagesPipeline
import scrapyclass imgPipeLine(ImagesPipeline):# 可以根据图片地址进行图片数据的请求def get_media_requests(self, item, info):yield scrapy.Request(item['src'])# 指定图片存储的路径def file_path(self, request, response=None, info=None, *, item=None):imgName = request.url.split('/')[-1]return imgNamedef item_completed(self, results, item, info):# 返回给下一个即将被执行的管道类return item

main.py内容如下

import scrapy
from first.items import FirstItemclass MainSpider(scrapy.Spider):name = "main"# allowed_domains = ["www.baidu.com"]start_urls = ["..."]def parse(self, response):li_list = response.xpath('...')for li in li_list:item = FirstItem()src = li.xpath('...').extract_first()# 图片地址item['src'] = srcyield item

五、中间件

下载中间件在引警和下载器之间，批量拦截到整个工程中所有的请求和响应
拦截请求作用：UA伪装（在process_request函数中），代理IP（在process_exception函数中，这里一定需要return request将修正之后的请求对象进行重写的请求发送）
拦截响应作用：篡改响应数据，响应对象

1、拦截请求中间件（UA伪装，代理IP）

1、打开pycharm的terminal
2、scrapy startproject first
3、cd first
4、scrapy genspider main www.xxx.com
5、修改settings.py里面的ROBOTSTXT_OBEY = True改为False并添加LOG_LEVEL='ERROR'
6、scrapy crawl main  (最后一步运行）

项目中的middlewares.py就是对应的中间件，MiddleproSpiderMiddleware类对应的就是爬虫中间件，这里不用可以删除，用的是下载中间件，即MiddleproDownloaderMiddleware类。MiddleproDownloaderMiddleware类中重点的是三个函数，即process_request（拦截请求），process_response（拦截所有响应），process_exception（拦截发送异常请求）

middlewares.py文件内容如下

class MiddleproDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s# UA池user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ""(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]# ipPROXY_http = ['153.180.102.104:80','195.208.131.189:56055']PROXY_https = ['120.83.49.90:9000', '95.189.112.214:35508']# 拦截请求def process_request(self, request, spider):# UA伪装request.headers['User-Agent'] = random.choice(self.user_agent_list)return None# 拦截所有响应def process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn response# 拦截发送异常请求def process_exception(self, request, exception, spider):# 当请求IP被禁用，爬取失败，进入到这里# 代理if request.url.split(':')[0] == 'http':request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)else:request.meta['proxy'] = 'http://' + random.choice(self.PROXY_https)# 将修正之后的请求对象进行重新的请求发送return requestdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)

将settings.py中的如下内容的注释取消

# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {"middlePro.middlewares.MiddleproDownloaderMiddleware": 543,
}

main.py主函数内容如下：

import scrapyclass MiddleSpider(scrapy.Spider):# 爬取百度（搜索ip）name = "middle"# allowed_domains = ["www.xxx.com"]start_urls = ["https://www.baidu.com/s?wd=ip"]def parse(self, response):page_text = response.textwith open('./ip.html', 'w', encoding='utf-8') as fp:fp.write(page_text)

2、拦截响应中间件(动态加载)

这里还是用的下载中间件，用来篡改响应数据，响应对象。以爬取某页面数据（标题和内容）为例，通首页解析一些标题栏的板块对应详情页的url（没有动态加载），每一个板块对应的标题都是动态加载出来的（动态加载），通过解析出每一条新闻详情页的url获取详情页的页面源码，解析出其详细的内容

首先创建项目

1、打开pycharm的terminal
2、scrapy startproject wangyipro
3、cd wangyipro
4、scrapy genspider main www.xxx.com
5、修改settings.py里面的ROBOTSTXT_OBEY = True改为False并添加LOG_LEVEL='ERROR'
6、scrapy crawl main  (最后一步运行）

注：标题栏每个标题栏里面固定位子的解析方式可能不一样，可能不能解析到正确的内容

settings.py取消如下注释

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {"wangyiPro.middlewares.WangyiproDownloaderMiddleware": 543,
}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"wangyiPro.pipelines.WangyiproPipeline": 300,
}

修改items.py

import scrapyclass WangyiproItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()content = scrapy.Field()pass

在pipelines.py添加输出

class WangyiproPipeline:# 这里存储数据def process_item(self, item, spider):print(item)return item

修改middlewares.py内容

from scrapy.http import HtmlResponse
from time import sleepclass WangyiproDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn None# 通过该方法拦截各特定板块对应的响应对象，进行篡改def process_response(self, request, response, spider): # spider爬虫对象# 获取爬虫类转给你定义的浏览器对象bro = spider.bro# 挑选出指定的响应对象进行篡改# 通过url指定让request# 通过request指定responseif request.url in spider.models_urls:# 指定的个板块对应的url进行请求bro.get(response.url)sleep(2)# 包含了动态加载的新闻数据page_text = bro.page_source# response # 特定板块对应的响应对象# 针对定位到的这些response进行篡改# 实例化一个新的响应对象（符合需求：包含动态加载出的新闻数据），替代原来旧的响应对象# 基于selenium便捷的获取动态加载数据new_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)return new_responseelse:# response # 其他请求对应的响应对象return responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)

编写主函数main.py

import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItemclass WangyiSpider(scrapy.Spider):name = "wangyi"# allowed_domains = ["www.xxx.com"]start_urls = ["..."]# 存储各板块urlmodels_url = []# 解析各板块对应详情页的url# 实例化浏览器对象def __init__(self):self.bro = webdriver.Chrome(executable_path='')def parse(self, response):li_list = response.xpath('...')# 解析获取每个标题栏的URL，并存储到列表中for li in li_list:model_url = li.xpath('...')[0].extract()self.models_url.append(model_url)# 依次对每个标题栏对应的页面进行请求for url in self.models_url:# 对每个板块的url进行请求发送yield scrapy.Request(url, callback=self.parse_model)# 每一个标题栏对应的内容标题相关的内容都是动态加载# 解析每一个板块页面中对应新闻的标题和新闻详情页的urldef parse_model(self, response):# 因为这里内容是动态加载出来的，所以用通常的方法response.xpath()是抓取不到的，需要在middleware.py文件中的process_response函数中编辑# 标题栏每个标题栏里面固定位子的解析方式可能不一样，需要分析每个页面的解析方式，可以根据属性值匹配div_list = response.xpath('...')for div in div_list:title = div.xpath('').extract_first()new_detail_url = div.xpath('').extract_first()item = WangyiproItem()item['title'] = title# 对新闻详情页的url发起请求yield scrapy.Request(url=new_detail_url, callback=self.parse_detail, meta={'item':item})# 解析新闻内容def parse_detail(self, response):content = response.xpath('').extract()content = ''.join(content)item = response.mate['item']item['content'] = contentyield itemdef close(self, spider):self.bro.quit()

六、CrawlSpider（自动请求全站爬取，全部页面，自动下拉滚轮爬取）

可以提取页面显示栏中显示及未显示页面的所有页码链接等信息

CrawlSpider是Spider的一个子类，和Spider（手动请求）一样可以爬取全站数据
链接提取器：根据指定规则（参数allow=“正则”）进行指定链接的提取
规则解析器：将链接提取器提取到的链接进行指定规则(callback)的解析操作

爬取全站数据：爬取的数据没有在同一个页面，多个页码
1.可以使用链接提取器提取所有的页码链接
2.让链接提取器提取所有的详情页链接

CrawlSpider的使用（加 -t crawl）：

1、打开pycharm的terminal
2、scrapy startproject first
3、cd first
4、scrapy genspider -t crawl main www.xxx.com
5、修改settings.py里面的ROBOTSTXT_OBEY = True改为False并添加LOG_LEVEL='ERROR'
6、scrapy crawl main  (最后一步运行）

items.py创建两个item类，

import scrapyclass SunproItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()new_num = scrapy.Field()class DetailItem(scrapy.Item):new_id = scrapy.Field()content = scrapy.Field()

pipelines.py

class SunproPipeline:def process_item(self, item, spider):# 如何判定item类型if item.__class__.__name__ == 'DetailItem':print(item['new_id'],item['new_content'])else:print(item['new_num'],item['new_title'])return item

sun.py主函数

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunPro.item import SunproItem,DetailItem# 需求爬取某网站的所有标签页的所有内容
class SunSpider(CrawlSpider):name = "sun"# allowed_domains = ["www.xxx.com"]start_urls = ["http://www.xxx.com/"]# 链接提取器：根据指定规则（allow="正则"）进行指定链接的提取# 提取页码链接link = LinkExtractor(allow=r"Items/")# 提取详情页链接link_detail = LinkExtractor(allow=r"...") ## Rule为规则解析器：将链接提取器提取到的链接进行指定规则(callback)的解析操作# Rule参数：链接提取器# follow改为True可以将链接提取器 继续作用到 链接提取器提取到的链接 所对应的页面中# follow为True可以提取下面显示及未显示页面的所有页码链接# follow为False只能提取下面显示的几个页面的链接# Rule中的callback调用对应parse_item函数rules = (Rule(link, callback="parse_item", follow=True),  # 提取页面链接Rule(link_detail, callback='parse_detail'))      # 提取详情页链接# 如下两个解析方法中是不可以实现请求传参# 如果将两个解析方法解析的数据存储到同一个item中，可以依次存储到两个item中，在items.py文件中建两个item类def parse_item(self, response):# xpath表达式中不可以出现tbgodybiao标签tr_list = response.xpath('...')for tr in tr_list:new_num = tr.xpath('...').extract_first()new_title = tr.xpath('...').extract_first()item = SunproItem()item['new_num'] = new_numitem['new_title'] = new_title# 解析详情页内容def parse_detail(self, response):new_id = response.xpath('...')new_content = response.xpath('...')item = DetailItem()item['new_id'] = new_iditem['new_content'] = new_content

七、分布式爬虫

搭建一个分布式的机群（多台电脑），当其中一台发出请求（多个url需要爬取），其他机器会一起爬取数据，提高效率

概念：需要搭建一个分布式的机群，让其对一组资源进行分布式联合爬取
作用：提升爬取数据效率

安装scrapy-redis组件pip install scrapy-redis
原生的scrapy不可以实现分布式爬虫，必须让scrapy组合这scrapy-redis组件一起实现分布式爬虫
原生的scrapy不可以实现分布式爬虫是因为调度器和管道都不可以被分布式机群共享
scrapy-redis组件可以给原生的scrapy框架提供可以被共享的管道和调度器

实现流程
创建一个工程
创建一个基于CrawlSpider的爬虫文件
修改当前的爬虫文件：

导包：from scrapy_redis.spoders import RedisCrawlSpider
将start_urls和allowed_domains进行注释
添加一个属性：redis_key = 'sun' 可以被空闲的调度器队列名称
编写数据解析相关的操作
将当前爬虫类的父类修改成RedisCrawlSpider

主函数

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom scrapy_redis.spiders import RedisCrawlSpider# 需求爬取某网站的所有标签页的所有内容
class SunSpider(RedisCrawlSpider):name = "sun"# allowed_domains = ["www.xxx.com"]# start_urls = ["http://www.xxx.com/"]redis_key = 'sun'link = LinkExtractor(allow=r"Items/")link_detail = LinkExtractor(allow=r"...")rules = (Rule(link, callback="parse_item", follow=True),Rule(link_detail, callback='parse_detail'))def parse_item(self, response):# xpath表达式中不可以出现tbgodybiao标签tr_list = response.xpath('...')for tr in tr_list:new_num = tr.xpath('...').extract_first()new_title = tr.xpath('...').extract_first()item = SunproItem()item['new_num'] = new_numitem['new_title'] = new_titleyield item

修改配置文件settings，除了和上面常规的修改，还有添加如下内容

# 指定使用可以被共享的管道
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline':400
}
# 指定调度器
# 增加一个去重容器类的配置，作用使用Redis的set集合来存储请求的指纹数据，从而实现请求取重的持久化存储
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy_redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 配置调度器是否要持久化，也就是当爬虫结束了，要不要清空Redis中请求队列和去重指纹的set，如果True
SCHEDULER_PERSIST = True# 指定redis服务器
REDIS_HOST = 'redis服务器ip地址'  # 写成redis远程服务器的ip
REDIS_PORT = 6379

redis相关操作配置：

配置redis配置文件linux或mac：redis.confwindows：redis.windows.conf打开配置文件修改：将bind 127.0.0.1注释关闭保护模式：protected-mode yes改为no
结合着配置文件开启redis服务redis-server 配置文件
启动客户端redis-cli

执行工程：scrapy runspider xxx.py
向调度器的队列中放入一个起始的url

调度器的队列在redis的客户端中lpush xxx www.xxx.com

爬取到的数据存储在redis的proName:items这个数据结构中

八、增量式爬虫

比如某个网站一定时间会更新一部分内容，有些不会更新，今天我们爬取了网站的所有内容，明天再爬取的时候，我们只需要爬取比昨天新增的内容，原先的不用再爬取，这就是增量式爬虫（如下核心部分）

检测网站数据更新的情况，只会爬取网站最新更新出来的数据
分析：

指定起始url：www.4567tv.tv
基于CrawlSpider获取其他页码链接
基于Rule将其他页码链接进行请求
从每一个页码对应的页面源码中解析出每一个电影详情页的URL
核心：检测电影详情页的url之前有没有请求过将爬取过的电影详情页的URL存储存储到redis的set数据结构（自动清楚重复数据，即存在过添加不进去，返回1表示不存在可以添加，返回0表示存在不添加）
对详情页的url发起请求，然后解析出电影的名称和简介
进行持久化存储