single-file scrapy spider (without all the default folder/project overhead)

just documentation for myself while still fresh in my head.

Here’s the simplest version. More details to follow (coloring terminal output, adding timestamps to a log file, explaining some of the delay and log settings, etc.)

Overview

  1. Create a CrawlSpider class
  2. Inside have a start_requests() class, where you take a list of urls (or a url) and yield a scrapy.Request() that calls a function you define (e.g. self.parse_page())
    • Note use of a base_url, this is because often links on a page are relative links
  3. Create the parsing function called earlier, e.g. most commonly use response.css() (and same as normal css selectors – period to select a class, :: to select the plaintext that’s inside a set of tags, or ::attr() to select a specific attribute inside the tag itself
  4. Create a dictionary of the desired values from the page, and yield it
  5. Create a CrawlerProcess that outputs to CSS
    • c = CrawlerProcess(settings={
      "FEEDS":{ "output_posts_crawler.csv" : {"format" : "csv"}}
      })
  6. Call the CrawlerProcess on the CrawlSpider defined earlier, then start the spider
    • c.crawl(PostSpider)
      c.start()

Here is some

example code

$ vi my_spider.py

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.crawler import CrawlerProcess

import numpy as np
import pandas as pd

class PostSpider(CrawlSpider):
    name = 'extract_posts'

    def start_requests(self):
        self.base_url = 'https://somesite.com'

        threads_list = pd.read_csv('some_list_of_urls.csv')
        # csv has headers: | link |
        urls_list = threads_list.dropna().link

        for url in urls_list:
            url = self.base_url + url
            self.logger.error(f'now working with url {url}')
            yield scrapy.Request(url=url, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.info(f'Parsing url {response.url}')
        page_name_and_pagination = response.css('title::text').get() 

        # - get post metadata
        posts = response.css('article.post')
            # <article class='post' data-content='id-123'>
            # <header class='msg-author'> <div class='msg-author'>
            # <a class='blah' href='index.php?threads/somethreadtitle/post-123'>
            #   No. 2</a>
            # </div> </header> </article>

        for post in posts: 
            post_id = post.css('::attr(data-content)').get()
            post_ordinal = post.css('.msg-author a::text').get() 

            self.logger.info(
                f'Now scraped: {page_name_and_pagination}')

            page_data = {
                'post_id': post_id,
                'post_ordinal': post_ordinal}
            yield page_data

c = CrawlerProcess(
    settings={
        "FEEDS":{
            "_tmp_posts.csv" : {"format" : "csv",
                                "overwrite":True,
                                "encoding": "utf8",
                            }},
        "CONCURRENT_REQUESTS":1, # default 16
        "CONCURRENT_REQUESTS_PER_DOMAIN":1, # default 8 
        "CONCURRENT_ITEMS":1, # default 100
        "DOWNLOAD_DELAY": 1, # default 0
        "DEPTH_LIMIT":0, # how many pages down to go in pagination
        "JOBDIR":'crawls/post_crawler',
        "DUPEFILTER_DEBUG":True, # don't rescrape a page
    }
)
c.crawl(PostSpider)
c.start()

$ python my_spider.py

end

that’s all folks