single-file scrapy spider (without all the default folder/project overhead)

just documentation for myself while still fresh in my head.

Here’s the simplest version. More details to follow (coloring terminal output, adding timestamps to a log file, explaining some of the delay and log settings, etc.)


  1. Create a CrawlSpider class
  2. Inside have a start_requests() class, where you take a list of urls (or a url) and yield a scrapy.Request() that calls a function you define (e.g. self.parse_page())
    • Note use of a base_url, this is because often links on a page are relative links
  3. Create the parsing function called earlier, e.g. most commonly use response.css() (and same as normal css selectors – period to select a class, :: to select the plaintext that’s inside a set of tags, or ::attr() to select a specific attribute inside the tag itself
  4. Create a dictionary of the desired values from the page, and yield it
  5. Create a CrawlerProcess that outputs to CSS
    • c = CrawlerProcess(settings={
      "FEEDS":{ "output_posts_crawler.csv" : {"format" : "csv"}}
  6. Call the CrawlerProcess on the CrawlSpider defined earlier, then start the spider
    • c.crawl(PostSpider)

Here is some

example code

$ vi

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.crawler import CrawlerProcess

import numpy as np
import pandas as pd

class PostSpider(CrawlSpider):
    name = 'extract_posts'

    def start_requests(self):
        self.base_url = ''

        threads_list = pd.read_csv('some_list_of_urls.csv')
        # csv has headers: | link |
        urls_list = threads_list.dropna().link

        for url in urls_list:
            url = self.base_url + url
            self.logger.error(f'now working with url {url}')
            yield scrapy.Request(url=url, callback=self.parse_page)

    def parse_page(self, response):'Parsing url {response.url}')
        page_name_and_pagination = response.css('title::text').get() 

        # - get post metadata
        posts = response.css('')
            # <article class='post' data-content='id-123'>
            # <header class='msg-author'> <div class='msg-author'>
            # <a class='blah' href='index.php?threads/somethreadtitle/post-123'>
            #   No. 2</a>
            # </div> </header> </article>

        for post in posts: 
            post_id = post.css('::attr(data-content)').get()
            post_ordinal = post.css('.msg-author a::text').get() 

                f'Now scraped: {page_name_and_pagination}')

            page_data = {
                'post_id': post_id,
                'post_ordinal': post_ordinal}
            yield page_data

c = CrawlerProcess(
            "_tmp_posts.csv" : {"format" : "csv",
                                "encoding": "utf8",
        "CONCURRENT_REQUESTS":1, # default 16
        "CONCURRENT_REQUESTS_PER_DOMAIN":1, # default 8 
        "CONCURRENT_ITEMS":1, # default 100
        "DOWNLOAD_DELAY": 1, # default 0
        "DEPTH_LIMIT":0, # how many pages down to go in pagination
        "DUPEFILTER_DEBUG":True, # don't rescrape a page

$ python


that’s all folks