single-file scrapy spider (without all the default folder/project overhead)

just documentation for myself while still fresh in my head.

Here’s the simplest version. More details to follow (coloring terminal output, adding timestamps to a log file, explaining some of the delay and log settings, etc.)

Overview

  1. Create a CrawlSpider class
  2. Inside have a start_requests() class, where you take a list of urls (or a url) and yield a scrapy.Request() that calls a function you define (e.g. self.parse_page())
    • Note use of a base_url, this is because often links on a page are relative links
  3. Create the parsing function called earlier, e.g. most commonly use response.css() (and same as normal css selectors – period to select a class, :: to select the plaintext that’s inside a set of tags, or ::attr() to select a specific attribute inside the tag itself
  4. Create a dictionary of the desired values from the page, and yield it
  5. Create a CrawlerProcess that outputs to CSS
    • c = CrawlerProcess(settings={
      "FEEDS":{ "output_posts_crawler.csv" : {"format" : "csv"}}
      })
  6. Call the CrawlerProcess on the CrawlSpider defined earlier, then start the spider
    • c.crawl(PostSpider)
      c.start()

Here is some

example code

$ vi my_spider.py

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.crawler import CrawlerProcess

import numpy as np
import pandas as pd

class PostSpider(CrawlSpider):
    name = 'extract_posts'

    def start_requests(self):
        self.base_url = 'https://somesite.com'

        threads_list = pd.read_csv('some_list_of_urls.csv')
        # csv has headers: | link |
        urls_list = threads_list.dropna().link

        for url in urls_list:
            url = self.base_url + url
            self.logger.error(f'now working with url {url}')
            yield scrapy.Request(url=url, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.info(f'Parsing url {response.url}')
        page_name_and_pagination = response.css('title::text').get() 

        # - get post metadata
        posts = response.css('article.post')
            # <article class='post' data-content='id-123'>
            # <header class='msg-author'> <div class='msg-author'>
            # <a class='blah' href='index.php?threads/somethreadtitle/post-123'>
            #   No. 2</a>
            # </div> </header> </article>

        for post in posts: 
            post_id = post.css('::attr(data-content)').get()
            post_ordinal = post.css('.msg-author a::text').get() 

            self.logger.info(
                f'Now scraped: {page_name_and_pagination}')

            page_data = {
                'post_id': post_id,
                'post_ordinal': post_ordinal}
            yield page_data

c = CrawlerProcess(
    settings={
        "FEEDS":{
            "_tmp_posts.csv" : {"format" : "csv",
                                "overwrite":True,
                                "encoding": "utf8",
                            }},
        "CONCURRENT_REQUESTS":1, # default 16
        "CONCURRENT_REQUESTS_PER_DOMAIN":1, # default 8 
        "CONCURRENT_ITEMS":1, # default 100
        "DOWNLOAD_DELAY": 1, # default 0
        "DEPTH_LIMIT":0, # how many pages down to go in pagination
        "JOBDIR":'crawls/post_crawler',
        "DUPEFILTER_DEBUG":True, # don't rescrape a page
    }
)
c.crawl(PostSpider)
c.start()

$ python my_spider.py

end

that’s all folks

yea, you’re pretty sleepy, and we don’t know why

aka idiopathic hypersomnia

Idiopathic hypersomnia (IH) is a neurological sleep disorder that can affect many aspects of a person’s life. Symptoms often begin between adolescence and young adulthood and develop over weeks to months. People with IH have a hard time staying awake and alert during the day (chronic excessive daytime sleepiness). They may fall asleep unintentionally or at inappropriate times, interfering with daily functioning. They may also have difficulty waking up from nighttime sleep or daytime naps. Sleeping longer at night does not appear to improve daytime sleepiness. The cause of IH is not known. Some people with IH have other family members with a sleep disorder such as IH or narcolepsy.

sometimes i think it’s made up, but then other times i remember how, even if it’s made up or not that severe, my sleepiness does impact many parts of my life. and life is short, why not happier if i can?

  • going to cafe by myself: good for waking up, but it’s awkward if I fall asleep there, so can’t go
  • going to student center (or library): good to get out of house (not just sleep for half a day or more), but at e.g. the Harvard student center, they’ll actually have security come around and wake you up if you fall asleep. needless to day it’s very unpleasant to come out of a dead sleep to a security person waking you up
    • i have gotten into a surprising amount of situations due to napping in public places. generally not an issue in some of the out-of-the-way parks in cambridge/somerville. generally is an issue in downtown boston. also depends on if you’re sitting up with a laptop in your nap, vs. fully laid out head on a bookbag jacket pulled up over your head
    • one time i decided that napping on the stairs along mass ave was needed, woke up to concerned strangers
  • a labmate once commented that they think it’s really rude if someone goes to a lecture and falls asleep, and you should not go in that case. i didn’t say that, if i needed to guarantee i wouldn’t fall asleep, in my case i’d never go to any lectures. but it’s important to go still to stay in the loop…
  • one time i fell asleep while sitting across from a professor waiting a few minutes for my turn to speak…
  • i’ve definitely gotten sleepy or overloaded to the point of all-consuming desire to find the nearest spot to fall asleep on, e.g. bench on the side of the street. like when you’re really hungry and it’s hard to think of anything else
  • i’ve really avoided 9-5 jobs, i just remember staring uselessly at my screen scrolling mindlessly for an hour because i couldn’t fully sleep and couldn’t fully not sleep
  • which is why i tend to want to work close to home

with my other medical issues stabilized i understood that my sleepiness definitely wasn’t part of e.g. depression, but just inherent to me. so eventually i went to get a sleep study, and got a diagnosis of idiopathic hypersomnia.

the sleep dr. treated this as a 5-minute appointment situation, and didn’t particularly care how i was treated / left it up to me, so i left feeling discouraged and that maybe i was making it all up anyway. so it went for several years until finally, i made an international trip that was supposed to be a “workcation”. instead i just slept 13 hrs a day long after i should have been un-jet lagged, despite being in a foreign country by myself which should have been exciting (admitted i had a severely sprained ankle and had been too scared of medical fees / didn’t get traveler’s insurance to go to a clinic / had never sprained anything so didn’t know to get a brace, i just got compression socks and a cane… and hopped around). combined with the falling asleep sitting up at a table, i really decided to handle this once and for all

anyway, medication has helped, but then (do to recent supply chain issues) i briefly came off of medication, and wow. how did i used to live like this? lol

it’s not so much the 1-3 hr long naps (despite sleeping 8-9hrs), but the randomness of them. and constantly devoting brainpower / feeling like i’m desperately running in circles — can i get more exercise? go out to a cafe? eat less carbs? scheduling meetings gets really stressful — what if i don’t wake up? should i try afternoon meetings? first thing in the morning?

the main thing of meds is I can reliably be awake from 9-12, and can thus have a semi-regular schedule

also, having a really supportive doctor who is willing to listen to me and patiently explain potential side effects, what to watch out for, side effects, how to test dosage, experience with other patients, and take my concerns seriously. so that i feel like there is something legitimate to my issues. vs. the original person just writing a script and saying “schedule a follow-up in four months, feel free to text.” and at the same time, i read things online and know that my issues are not nearly as severe as others.

anyway, some links:

https://www.dynamed.com/condition/classification-of-sleep-disorders#IDIOPATHIC_HYPERSOMNIA

https://rarediseases.info.nih.gov/diseases/8737/idiopathic-hypersomnia

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5558858/

https://emedicine.medscape.com/article/291699-overview

https://www.sleepfoundation.org/excessive-sleepiness

it’s complicated diagnostically

All these disorders have in common a subjective complaint of excessive sleepiness. ICSD-3 defines this as “daily episodes of an irrepressible need to sleep or daytime lapses into sleep.” For those disorders, such as narcolepsy and idiopathic hypersomnia (IH), which require demonstration of objective sleepiness by the multiple sleep latency test (MSLT), a mean sleep latency of < 8 min on the MSLT is required. This criterion is unchanged from the ICSD-2 and represents the best compromise between sensitivity and specificity.
However, physicians must recognize that there is substantial overlap between pathologically sleepy individuals and “normal” (often sleep-deprived) persons. Therefore, in establishing a diagnosis of a central disorder of hypersomnolence, physicians must be keenly aware that sleep deprivation, especially in those with longer sleep requirements, may account for abnormal MSLT results. The use of sleep logs and actigraphy for at least 1 week prior to MSLT is strongly encouraged to rule out insufficient sleep, sleep-wake schedule disturbances, or both as potential explanations for abnormal MSLT findings. Limited data suggest that one-off subjective reports and sleep logs alone may significantly overestimate total sleep time in the days prior to MSLT. Conversely, some patients with legitimate central hypersomnolence conditions may not consistently demonstrate mean MSLT latencies of < 8 min. Clinical judgment is required in such cases. Repeat MSLT at a later date may confirm objective sleepiness.

even MSLT is suspect though

First, neither short MSL nor SOREMPs are specific. Up to 30% of the normal population may have a MSL ≤ 8 min, the current cutoff for the hypersomnolence disorders.

Second, the MSLT may not be adequately sensitive, especially for IH. The 8-min cutoff was determined for patients with narcolepsy and extended to IH for “simplicity,” without independent determination

Third, while MSLT test-retest reliability is high in patients with narcolepsy with cataplexy restudied within 3 weeks,53 in clinical practice, test-retest reliability of the MSLT in narcolepsy without cataplexy and IH is poor. More than one-half of subjects with these disorders are given a changed diagnosis on repeat testing

some more interesting thoughts: ( i will need to cite all these later )

clinical presentation includes irresistible attacks of daytime sleepiness, unwanted, unrefreshing daytime naps ≥ 1 hour long, difficulty waking up from naps, and sleep inertia (sleep drunkenness)

Long sleepers feel fully refreshed and do not experience
daytime sleepiness if they are allowed to sleep as long as they need, in contrast with
patients with IH who continue to feel sleepy regardless of prior sleep duration.
In contrast to narcolepsy, patients with idiopathic hypersomnia generally have high sleep efficiency, sleep drunkenness, and long, unrefreshing naps.

Sleepiness is typically experienced as the inability to stay awake when desired, yet the MSLT measures “sleepability,” or the ability to fall asleep on command. These two constructs, while related, are clearly not identical.

my conclusions / specifics

Total recording time was 528 minutes, from 23:53-8:32. Total sleep time was 385 minutes, with a sleep efficiency of 74%. Sleep and REM latencies were 17 and 116 minutes respectively.
There was 19% stage 1, 54% stage 2, 8% slow wave sleep and 18% REM, with 2 REM cycles.

There was moderate to severe hypersomnolence, with a mean sleep latency of 6 minutes, and no sleep onset REM in any of 4 naps. The sleep latencies were 8, 12.5, 1.5, 3 minutes respectively. The nap times were 10:05, 12:28, 2:23, 4:43

moderate to severe hypersomnolence with no objective sleep onset REM. Delayed sleep phase. The sleep latencies do not just reflect tendency to delayed sleep phase. Consistent with diagnosis of idiopathic hypersomnolence.

complicated in diagnosis, but not in treatment, so in some sense it doesn’t matter in the end to me as a patient…