scrapy_redis package

Submodules

scrapy_redis.connection module

scrapy_redis.connection.from_settings(settings)

Returns a redis client instance from given Scrapy settings object.

This function uses get_client to instantiate the client and uses defaults.REDIS_PARAMS global as defaults values for the parameters. You can override them using the REDIS_PARAMS setting.

Parameters:

settings (Settings) – A scrapy settings object. See the supported settings below.

Returns:

Redis client instance.

Return type:

server

Other Parameters:
 
  • REDIS_URL (str, optional) – Server connection URL.
  • REDIS_HOST (str, optional) – Server host.
  • REDIS_PORT (str, optional) – Server port.
  • REDIS_ENCODING (str, optional) – Data encoding.
  • REDIS_PARAMS (dict, optional) – Additional client parameters.
scrapy_redis.connection.get_redis(**kwargs)[source]

Returns a redis client instance.

Parameters:
  • redis_cls (class, optional) – Defaults to redis.StrictRedis.
  • url (str, optional) – If given, redis_cls.from_url is used to instantiate the class.
  • **kwargs – Extra parameters to be passed to the redis_cls class.
Returns:

Redis client instance.

Return type:

server

scrapy_redis.connection.get_redis_from_settings(settings)[source]

Returns a redis client instance from given Scrapy settings object.

This function uses get_client to instantiate the client and uses defaults.REDIS_PARAMS global as defaults values for the parameters. You can override them using the REDIS_PARAMS setting.

Parameters:

settings (Settings) – A scrapy settings object. See the supported settings below.

Returns:

Redis client instance.

Return type:

server

Other Parameters:
 
  • REDIS_URL (str, optional) – Server connection URL.
  • REDIS_HOST (str, optional) – Server host.
  • REDIS_PORT (str, optional) – Server port.
  • REDIS_ENCODING (str, optional) – Data encoding.
  • REDIS_PARAMS (dict, optional) – Additional client parameters.

scrapy_redis.dupefilter module

class scrapy_redis.dupefilter.RFPDupeFilter(server, key, debug=False)[source]

Bases: scrapy.dupefilters.BaseDupeFilter

Redis-based request duplicates filter.

This class can also be used with default Scrapy’s scheduler.

clear()[source]

Clears fingerprints data.

close(reason='')[source]

Delete data on close. Called by Scrapy’s scheduler.

Parameters:reason (str, optional) –
classmethod from_crawler(crawler)[source]

Returns instance from crawler.

Parameters:crawler (scrapy.crawler.Crawler) –
Returns:Instance of RFPDupeFilter.
Return type:RFPDupeFilter
classmethod from_settings(settings)[source]

Returns an instance from given settings.

This uses by default the key dupefilter:<timestamp>. When using the scrapy_redis.scheduler.Scheduler class, this method is not used as it needs to pass the spider name in the key.

Parameters:settings (scrapy.settings.Settings) –
Returns:A RFPDupeFilter instance.
Return type:RFPDupeFilter
log(request, spider)[source]

Logs given request.

Parameters:
  • request (scrapy.http.Request) –
  • spider (scrapy.spiders.Spider) –
logger = <logging.Logger object>
request_fingerprint(request)[source]

Returns a fingerprint for a given request.

Parameters:request (scrapy.http.Request) –
Returns:
Return type:str
request_seen(request)[source]

Returns True if request was already seen.

Parameters:request (scrapy.http.Request) –
Returns:
Return type:bool

scrapy_redis.pipelines module

class scrapy_redis.pipelines.RedisPipeline(server, key='%(spider)s:items', serialize_func=<bound method ScrapyJSONEncoder.encode of <scrapy.utils.serialize.ScrapyJSONEncoder object>>)[source]

Bases: object

Pushes serialized item into a redis list/queue

REDIS_ITEMS_KEY : str
Redis key where to store items.
REDIS_ITEMS_SERIALIZER : str
Object path to serializer function.
classmethod from_crawler(crawler)[source]
classmethod from_settings(settings)[source]
item_key(item, spider)[source]

Returns redis key based on given spider.

Override this function to use a different key depending on the item and/or spider.

process_item(item, spider)[source]

scrapy_redis.queue module

class scrapy_redis.queue.Base(server, spider, key, serializer=None)[source]

Bases: object

Per-spider base queue class

clear()[source]

Clear queue/stack

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.FifoQueue(server, spider, key, serializer=None)[source]

Bases: scrapy_redis.queue.Base

Per-spider FIFO queue

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.LifoQueue(server, spider, key, serializer=None)[source]

Bases: scrapy_redis.queue.Base

Per-spider LIFO queue.

pop(timeout=0)[source]

Pop a request

push(request)[source]

Push a request

class scrapy_redis.queue.PriorityQueue(server, spider, key, serializer=None)[source]

Bases: scrapy_redis.queue.Base

Per-spider priority queue abstraction using redis’ sorted set

pop(timeout=0)[source]

Pop a request timeout not support in this queue class

push(request)[source]

Push a request

scrapy_redis.queue.SpiderPriorityQueue

alias of PriorityQueue

scrapy_redis.queue.SpiderQueue

alias of FifoQueue

scrapy_redis.queue.SpiderStack

alias of LifoQueue

scrapy_redis.scheduler module

class scrapy_redis.scheduler.Scheduler(server, persist=False, flush_on_start=False, queue_key='%(spider)s:requests', queue_cls='scrapy_redis.queue.PriorityQueue', dupefilter_key='%(spider)s:dupefilter', dupefilter_cls='scrapy_redis.dupefilter.RFPDupeFilter', idle_before_close=0, serializer=None)[source]

Bases: object

Redis-based scheduler

SCHEDULER_PERSIST : bool (default: False)
Whether to persist or clear redis queue.
SCHEDULER_FLUSH_ON_START : bool (default: False)
Whether to flush redis queue on start.
SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0)
How many seconds to wait before closing if no message is received.
SCHEDULER_QUEUE_KEY : str
Scheduler redis key.
SCHEDULER_QUEUE_CLASS : str
Scheduler queue class.
SCHEDULER_DUPEFILTER_KEY : str
Scheduler dupefilter redis key.
SCHEDULER_DUPEFILTER_CLASS : str
Scheduler dupefilter class.
SCHEDULER_SERIALIZER : str
Scheduler serializer.
close(reason)[source]
enqueue_request(request)[source]
flush()[source]
classmethod from_crawler(crawler)[source]
classmethod from_settings(settings)[source]
has_pending_requests()[source]
next_request()[source]
open(spider)[source]

scrapy_redis.spiders module

class scrapy_redis.spiders.RedisCrawlSpider(*a, **kw)[source]

Bases: scrapy_redis.spiders.RedisMixin, scrapy.spiders.crawl.CrawlSpider

Spider that reads urls from redis queue when idle.

redis_key

str (default: REDIS_START_URLS_KEY) – Redis key where to fetch start URLs from..

redis_batch_size

int (default: CONCURRENT_REQUESTS) – Number of messages to fetch from redis on each attempt.

redis_encoding

str (default: REDIS_ENCODING) – Encoding to use when decoding messages from redis queue.

Settings
--------
REDIS_START_URLS_KEY

str (default: “<spider.name>:start_urls”) – Default Redis key where to fetch start URLs from..

REDIS_START_URLS_BATCH_SIZE

int (deprecated by CONCURRENT_REQUESTS) – Default number of messages to fetch from redis on each attempt.

REDIS_START_URLS_AS_SET

bool (default: True) – Use SET operations to retrieve messages from the redis queue.

REDIS_ENCODING

str (default: “utf-8”) – Default encoding to use when decoding messages from redis queue.

classmethod from_crawler(crawler, *args, **kwargs)[source]
class scrapy_redis.spiders.RedisMixin[source]

Bases: object

Mixin class to implement reading urls from a redis queue.

make_request_from_data(data)[source]

Returns a Request instance from data coming from Redis.

By default, data is an encoded URL. You can override this method to provide your own message decoding.

Parameters:data (bytes) – Message from redis.
next_requests()[source]

Returns a request to be scheduled or none.

redis_batch_size = None
redis_encoding = None
redis_key = None
schedule_next_requests()[source]

Schedules a request if available

server = None
setup_redis(crawler=None)[source]

Setup redis connection and idle signal.

This should be called after the spider has set its crawler object.

spider_idle()[source]

Schedules a request if available, otherwise waits.

start_requests()[source]

Returns a batch of start requests from redis.

class scrapy_redis.spiders.RedisSpider(name=None, **kwargs)[source]

Bases: scrapy_redis.spiders.RedisMixin, scrapy.spiders.Spider

Spider that reads urls from redis queue when idle.

redis_key

str (default: REDIS_START_URLS_KEY) – Redis key where to fetch start URLs from..

redis_batch_size

int (default: CONCURRENT_REQUESTS) – Number of messages to fetch from redis on each attempt.

redis_encoding

str (default: REDIS_ENCODING) – Encoding to use when decoding messages from redis queue.

Settings
--------
REDIS_START_URLS_KEY

str (default: “<spider.name>:start_urls”) – Default Redis key where to fetch start URLs from..

REDIS_START_URLS_BATCH_SIZE

int (deprecated by CONCURRENT_REQUESTS) – Default number of messages to fetch from redis on each attempt.

REDIS_START_URLS_AS_SET

bool (default: False) – Use SET operations to retrieve messages from the redis queue. If False, the messages are retrieve using the LPOP command.

REDIS_ENCODING

str (default: “utf-8”) – Default encoding to use when decoding messages from redis queue.

classmethod from_crawler(crawler, *args, **kwargs)[source]

Module contents