scrapy_redis package¶
Submodules¶
scrapy_redis.connection module¶
-
scrapy_redis.connection.
from_settings
(settings)¶ Returns a redis client instance from given Scrapy settings object.
This function uses
get_client
to instantiate the client and usesdefaults.REDIS_PARAMS
global as defaults values for the parameters. You can override them using theREDIS_PARAMS
setting.Parameters: settings (Settings) – A scrapy settings object. See the supported settings below.
Returns: Redis client instance.
Return type: Other Parameters: - REDIS_URL (str, optional) – Server connection URL.
- REDIS_HOST (str, optional) – Server host.
- REDIS_PORT (str, optional) – Server port.
- REDIS_ENCODING (str, optional) – Data encoding.
- REDIS_PARAMS (dict, optional) – Additional client parameters.
-
scrapy_redis.connection.
get_redis
(**kwargs)[source]¶ Returns a redis client instance.
Parameters: - redis_cls (class, optional) – Defaults to
redis.StrictRedis
. - url (str, optional) – If given,
redis_cls.from_url
is used to instantiate the class. - **kwargs – Extra parameters to be passed to the
redis_cls
class.
Returns: Redis client instance.
Return type: - redis_cls (class, optional) – Defaults to
-
scrapy_redis.connection.
get_redis_from_settings
(settings)[source]¶ Returns a redis client instance from given Scrapy settings object.
This function uses
get_client
to instantiate the client and usesdefaults.REDIS_PARAMS
global as defaults values for the parameters. You can override them using theREDIS_PARAMS
setting.Parameters: settings (Settings) – A scrapy settings object. See the supported settings below.
Returns: Redis client instance.
Return type: Other Parameters: - REDIS_URL (str, optional) – Server connection URL.
- REDIS_HOST (str, optional) – Server host.
- REDIS_PORT (str, optional) – Server port.
- REDIS_ENCODING (str, optional) – Data encoding.
- REDIS_PARAMS (dict, optional) – Additional client parameters.
scrapy_redis.dupefilter module¶
-
class
scrapy_redis.dupefilter.
RFPDupeFilter
(server, key, debug=False)[source]¶ Bases:
scrapy.dupefilters.BaseDupeFilter
Redis-based request duplicates filter.
This class can also be used with default Scrapy’s scheduler.
-
close
(reason='')[source]¶ Delete data on close. Called by Scrapy’s scheduler.
Parameters: reason (str, optional) –
-
classmethod
from_crawler
(crawler)[source]¶ Returns instance from crawler.
Parameters: crawler (scrapy.crawler.Crawler) – Returns: Instance of RFPDupeFilter. Return type: RFPDupeFilter
-
classmethod
from_settings
(settings)[source]¶ Returns an instance from given settings.
This uses by default the key
dupefilter:<timestamp>
. When using thescrapy_redis.scheduler.Scheduler
class, this method is not used as it needs to pass the spider name in the key.Parameters: settings (scrapy.settings.Settings) – Returns: A RFPDupeFilter instance. Return type: RFPDupeFilter
-
log
(request, spider)[source]¶ Logs given request.
Parameters: - request (scrapy.http.Request) –
- spider (scrapy.spiders.Spider) –
-
logger
= <logging.Logger object>¶
-
scrapy_redis.pipelines module¶
-
class
scrapy_redis.pipelines.
RedisPipeline
(server, key='%(spider)s:items', serialize_func=<bound method ScrapyJSONEncoder.encode of <scrapy.utils.serialize.ScrapyJSONEncoder object>>)[source]¶ Bases:
object
Pushes serialized item into a redis list/queue
- REDIS_ITEMS_KEY : str
- Redis key where to store items.
- REDIS_ITEMS_SERIALIZER : str
- Object path to serializer function.
scrapy_redis.queue module¶
-
class
scrapy_redis.queue.
Base
(server, spider, key, serializer=None)[source]¶ Bases:
object
Per-spider base queue class
-
class
scrapy_redis.queue.
FifoQueue
(server, spider, key, serializer=None)[source]¶ Bases:
scrapy_redis.queue.Base
Per-spider FIFO queue
-
class
scrapy_redis.queue.
LifoQueue
(server, spider, key, serializer=None)[source]¶ Bases:
scrapy_redis.queue.Base
Per-spider LIFO queue.
-
class
scrapy_redis.queue.
PriorityQueue
(server, spider, key, serializer=None)[source]¶ Bases:
scrapy_redis.queue.Base
Per-spider priority queue abstraction using redis’ sorted set
-
scrapy_redis.queue.
SpiderPriorityQueue
¶ alias of
PriorityQueue
scrapy_redis.scheduler module¶
-
class
scrapy_redis.scheduler.
Scheduler
(server, persist=False, flush_on_start=False, queue_key='%(spider)s:requests', queue_cls='scrapy_redis.queue.PriorityQueue', dupefilter_key='%(spider)s:dupefilter', dupefilter_cls='scrapy_redis.dupefilter.RFPDupeFilter', idle_before_close=0, serializer=None)[source]¶ Bases:
object
Redis-based scheduler
- SCHEDULER_PERSIST : bool (default: False)
- Whether to persist or clear redis queue.
- SCHEDULER_FLUSH_ON_START : bool (default: False)
- Whether to flush redis queue on start.
- SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0)
- How many seconds to wait before closing if no message is received.
- SCHEDULER_QUEUE_KEY : str
- Scheduler redis key.
- SCHEDULER_QUEUE_CLASS : str
- Scheduler queue class.
- SCHEDULER_DUPEFILTER_KEY : str
- Scheduler dupefilter redis key.
- SCHEDULER_DUPEFILTER_CLASS : str
- Scheduler dupefilter class.
- SCHEDULER_SERIALIZER : str
- Scheduler serializer.
scrapy_redis.spiders module¶
-
class
scrapy_redis.spiders.
RedisCrawlSpider
(*a, **kw)[source]¶ Bases:
scrapy_redis.spiders.RedisMixin
,scrapy.spiders.crawl.CrawlSpider
Spider that reads urls from redis queue when idle.
-
redis_key
¶ str (default: REDIS_START_URLS_KEY) – Redis key where to fetch start URLs from..
-
redis_batch_size
¶ int (default: CONCURRENT_REQUESTS) – Number of messages to fetch from redis on each attempt.
-
redis_encoding
¶ str (default: REDIS_ENCODING) – Encoding to use when decoding messages from redis queue.
-
Settings
¶
-
--------
-
REDIS_START_URLS_KEY
¶ str (default: “<spider.name>:start_urls”) – Default Redis key where to fetch start URLs from..
-
REDIS_START_URLS_BATCH_SIZE
¶ int (deprecated by CONCURRENT_REQUESTS) – Default number of messages to fetch from redis on each attempt.
-
REDIS_START_URLS_AS_SET
¶ bool (default: True) – Use SET operations to retrieve messages from the redis queue.
-
REDIS_ENCODING
¶ str (default: “utf-8”) – Default encoding to use when decoding messages from redis queue.
-
-
class
scrapy_redis.spiders.
RedisMixin
[source]¶ Bases:
object
Mixin class to implement reading urls from a redis queue.
-
make_request_from_data
(data)[source]¶ Returns a Request instance from data coming from Redis.
By default,
data
is an encoded URL. You can override this method to provide your own message decoding.Parameters: data (bytes) – Message from redis.
-
redis_batch_size
= None¶
-
redis_encoding
= None¶
-
redis_key
= None¶
-
server
= None¶
-
-
class
scrapy_redis.spiders.
RedisSpider
(name=None, **kwargs)[source]¶ Bases:
scrapy_redis.spiders.RedisMixin
,scrapy.spiders.Spider
Spider that reads urls from redis queue when idle.
-
redis_key
¶ str (default: REDIS_START_URLS_KEY) – Redis key where to fetch start URLs from..
-
redis_batch_size
¶ int (default: CONCURRENT_REQUESTS) – Number of messages to fetch from redis on each attempt.
-
redis_encoding
¶ str (default: REDIS_ENCODING) – Encoding to use when decoding messages from redis queue.
-
Settings
¶
-
--------
-
REDIS_START_URLS_KEY
¶ str (default: “<spider.name>:start_urls”) – Default Redis key where to fetch start URLs from..
-
REDIS_START_URLS_BATCH_SIZE
¶ int (deprecated by CONCURRENT_REQUESTS) – Default number of messages to fetch from redis on each attempt.
-
REDIS_START_URLS_AS_SET
¶ bool (default: False) – Use SET operations to retrieve messages from the redis queue. If False, the messages are retrieve using the LPOP command.
-
REDIS_ENCODING
¶ str (default: “utf-8”) – Default encoding to use when decoding messages from redis queue.
-