All posts by nouyang

Diary #89 – lifespan clock

yea ok it’s good to split these into diary / not diary since only half my blog posts are about technical things anymore

i’m here to chronicle some issues i never thought about. they suddenly coalesced over the last few months

i’ve always been consciously against the idea of a biological clock as causing semi-sexist complications (in terms of employer discrimination, career ambitions, etc.), kind of allergic to the concept actually

i mean wow way to give me FOMO anxiety BBC, why didn’t I hear about this when I was 20

https://www.bbc.com/worklife/article/20220603-why-women-have-to-sprint-into-leadership-positions

There is immense pressure for women to reach a certain level of career and financial success before becoming parents, says Karin Kimbrough, Chief Economist at LinkedIn, who conducted the research into the 10-year window to leadership.

Kimbrough calls this process a “sprint” to leadership, meaning that women who don’t scale the leadership ladder very quickly are less likely to make it to the top at all. This might mean they end up overworking or making enormous personal sacrifices in order to ascend to C-suite level during this crucial decade. Much of this urgency to sprint – and the exhausting overwork it involves – stems from women needing to make sure their careers don’t sink once they begin families.

They are racing the clock against the so-called motherhood penalty. In this phenomenon, women find their careers stalling in areas such as promotion and pay once their children are born (while, conversely, men’s careers accelerate after becoming fathers). This effect, as well as the enormous burden of caregiving responsibilities that women take on, is well documented (and similarly affects other types of caregivers, like looking after ageing parents, says Kimbrough).

Katie Bishop, for BBC

but through conversations (maybe 2 or 3 a year) i’ve started to realize things

  • parents don’t stay spry forever, and it’s nice if they’re lively enough to help care for kids (b/c they’re a lot of work, and that reduces the career impact)

this is the result of talking to friend who consciously wanted her kid to know their grandparents are full independent adults, and be able to get to know their grandparents as adults also

+ reflecting that yea, my relationship with my parents is vastly different at 30 than at 20. my grandparents all died relatively young, and I wish I’d gotten to know them a bit more when they were healthy.

+ also staying with an older person, who I think even 5 years ago was really spry and independent, but is now almost housebound

  • my other recent realization, it can take a while to get pregnant even. apparently on average it can take 6 months

I mean like this is so terrible from a project management / timeline perspective lol.

(Of course there is fostering, adopting, surrogacy, etc. Though I know relatively little about these. Other interesting rabbit holes — single dads by choice — see appendix.)

  • then i looked up more details… you’re supposed to wait (according to science) between 1.5 to 2 years after giving birth before starting to try to have another kid

all this starts to add up!

  • also, if i want to wait to run for senate until after my kids are off to college, i don’t want to be another geriatric geezer in congress 0:

anyhow, basically it was the combination of spacing (interpregnancy interval) + delay (time to actually get pregnant) which was like … that’s an extra 2-3 years on top of my timeline estimate 0: 0: 0:

anyhow yea not very technical but just stuff i never thought about — there’s a lifespan clock separate from fertility clock (which! btw! exists for guys too!)

i wonder if i’ll look back in 30 years and laugh at the “geriatric geezer” comment haha

appendix – single fathers by choice

Recently i learned there are also single fathers by choice, but only very few studies on them. Many studies are on gay dads having surrogate children (often in Israel — apparently the government / religion / culture really emphasizes having kids o__o — but it’s complicated because (according to the quotes in the papers) the conservative government also doesn’t really approve of LGBT. So the dads almost viewed it as an empowering act to have surrogate families). But far fewer on *single* dads by choice.

  • https://www.semanticscholar.org/paper/The-Social-Experiences-of-Single-Gay-Fathers-in-An-Tsfati-Segal%E2%80%90Engelchin/8405f0b875742b805de49aa91292638385c4293f
  • https://www.semanticscholar.org/paper/Children-of-Single-Fathers-Created-by-Surrogacy%3A-Pereira/ee815c5b824b37e96bc2f42088a66aa709e6dff9
  • https://www.theguardian.com/society/2020/jan/29/i-always-wanted-to-be-a-dad-the-rise-of-single-fathers-by-choice

single-file scrapy spider (without all the default folder/project overhead)

just documentation for myself while still fresh in my head.

Here’s the simplest version. More details to follow (coloring terminal output, adding timestamps to a log file, explaining some of the delay and log settings, etc.)

Overview

  1. Create a CrawlSpider class
  2. Inside have a start_requests() class, where you take a list of urls (or a url) and yield a scrapy.Request() that calls a function you define (e.g. self.parse_page())
    • Note use of a base_url, this is because often links on a page are relative links
  3. Create the parsing function called earlier, e.g. most commonly use response.css() (and same as normal css selectors – period to select a class, :: to select the plaintext that’s inside a set of tags, or ::attr() to select a specific attribute inside the tag itself
  4. Create a dictionary of the desired values from the page, and yield it
  5. Create a CrawlerProcess that outputs to CSS
    • c = CrawlerProcess(settings={
      "FEEDS":{ "output_posts_crawler.csv" : {"format" : "csv"}}
      })
  6. Call the CrawlerProcess on the CrawlSpider defined earlier, then start the spider
    • c.crawl(PostSpider)
      c.start()

Here is some

example code

$ vi my_spider.py

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy.crawler import CrawlerProcess

import numpy as np
import pandas as pd

class PostSpider(CrawlSpider):
    name = 'extract_posts'

    def start_requests(self):
        self.base_url = 'https://somesite.com'

        threads_list = pd.read_csv('some_list_of_urls.csv')
        # csv has headers: | link |
        urls_list = threads_list.dropna().link

        for url in urls_list:
            url = self.base_url + url
            self.logger.error(f'now working with url {url}')
            yield scrapy.Request(url=url, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.info(f'Parsing url {response.url}')
        page_name_and_pagination = response.css('title::text').get() 

        # - get post metadata
        posts = response.css('article.post')
            # <article class='post' data-content='id-123'>
            # <header class='msg-author'> <div class='msg-author'>
            # <a class='blah' href='index.php?threads/somethreadtitle/post-123'>
            #   No. 2</a>
            # </div> </header> </article>

        for post in posts: 
            post_id = post.css('::attr(data-content)').get()
            post_ordinal = post.css('.msg-author a::text').get() 

            self.logger.info(
                f'Now scraped: {page_name_and_pagination}')

            page_data = {
                'post_id': post_id,
                'post_ordinal': post_ordinal}
            yield page_data

c = CrawlerProcess(
    settings={
        "FEEDS":{
            "_tmp_posts.csv" : {"format" : "csv",
                                "overwrite":True,
                                "encoding": "utf8",
                            }},
        "CONCURRENT_REQUESTS":1, # default 16
        "CONCURRENT_REQUESTS_PER_DOMAIN":1, # default 8 
        "CONCURRENT_ITEMS":1, # default 100
        "DOWNLOAD_DELAY": 1, # default 0
        "DEPTH_LIMIT":0, # how many pages down to go in pagination
        "JOBDIR":'crawls/post_crawler',
        "DUPEFILTER_DEBUG":True, # don't rescrape a page
    }
)
c.crawl(PostSpider)
c.start()

$ python my_spider.py

end

that’s all folks