Skip to content

juniorguru/plucker

Repository files navigation

Plucker 🪶

Junior Guru scrapers

How does it work

This repository contains a source code of all scrapers Junior Guru needs for its functioning. The scrapers are implemented using the Scrapy framework and albeit there are some customizations, most Scrapy conventions should work out of the box. Contributing new scraper shouldn't be hard if you have some knowledge of how Scrapy works.

The scrapers are then deployed to the Apify platform as so called actors. The code here works as a monorepo for Apify actors and diverges quite significantly from the Scrapy template Apify provides. Deploying new scraper to Apify is a manual process and it is documented below.

Code in this repository is executed by Apify, on their infrastructure. The main Junior Guru codebase then gets the scraped data in form of datasets available through the Apify API.

Running scrapers

Use Scrapy's crawl command or its shell. Plucker has a crawl CLI command, which you can also use, but it's more useful for integrating with Apify than for the actual development of the scraper.

After each scraper run you can check the contents of the items.json file to see if your scraper works correctly.

Passing parameters

Sometimes scrapers need input data. Plucker's crawl CLI command can pass parameters to your scraper through the --params option. Use shell to pass ad-hoc data:

$ echo '{"links": [{"url": "https://junior.guru"}]}' | plucker crawl job-links --params

Use file for more complex input:

$ plucker crawl job-links --params < params.json

You can also run it simply as plucker crawl job-links --params and type the JSON manually as a standard input of the command.

At Apify, whatever is set as the actor input gets passed down as params (except for the proxy settings). Don't forget to add input schema in your actor.json.

You can access the params inside the spider class as self.settings.get("SPIDER_PARAMS").

How to add new scraper

Look at existing code and follow conventions. Creating new scraper, e.g. gravel-bikes:

  1. Should the new scraper produce items not yet known to this codebase, such as bikes, go to jg/plucker/items.py and add a new Scrapy Item class, e.g. GravelBike. Run plucker schemas to generate schema for Apify. Should the new scraper produce items already known to this codebase, such as jobs, you can skip this step.
  2. Run plucker new, answer questions. It is a cookiecutter. It takes the scraper_template directory and creates a scaffolding of a new scraper for you.
  3. Fill the newly created jg/plucker/gravel_bikes/spider.py file with implementation of your scraper. See Scrapy documentation: Tutorial, Spiders You can also learn scraping from the Apify's Web scraping basics course.
  4. Make sure the spider produces instances of the selected Item subclass, e.g. GravelBike.
  5. Run the spider with scrapy crawl gravel-bikes. Learn about Scrapy's crawl command or its shell. Develop and debug.
  6. Test the spider, i.e. create tests/gravel_bikes directory with test_spider.py inside and optionally with some test fixtures (static HTML files etc.) around.

Deploying to Apify (you need to be admin)

  1. Push all your code to GitHub.
  2. Run plucker deploy gravel-bikes.
  3. Go to Apify Console and verify everything went well.
  4. Go to the Builds tab and start a build.
  5. Go to the Runs tab and try a first run.
  6. Go to the Schedules page and assign your new actor to an existing schedule or create a new one.

Automatic builds

There is a nightly GitHub Action which re-builds all actors based on current code in the main branch. This is because Apify's built-in automatic builds didn't work properly, but also because it would be undesirable to waste resources when committing code often.

Monitoring

There is a nightly GitHub Action which checks whether each actor's last run finished with success. In case they didn't, the GitHub Action fails, which causes an e-mail notification. Apify used to send summary e-mail about actor runs, but they removed that feature and there was no equivalent at the time. Maybe there is now, but the monitoring is already implemented, so…

Notes on development

  • Use Poetry for dependency management. Run poetry install.
  • It is preferred to pin exact versions of dependencies, without ^, and let GitHub's Dependabot to upgrade dependencies in Pull Requests. Unfortunately there is no setting in pyproject.toml, which would force this behavior, so once new dependencies are added, one needs to go and manually remove the ^ characters.
  • Run pytest to see if your code has any issues.
  • Run ruff check --fix and ruff format to fix your code.

Dictionary

  • scraper - Generic name for a program which downloads data from web pages (or APIs). This repository uses the word scraper to refer to a spider & actor combo.
  • spider - This is how Scrapy framework calls implementation of a scraper.
  • actor - This is how the Apify platform calls implementation of a scraper.
  • plucker - Repository of Junior Guru scrapers. In English, a plucker is one who or that which plucks. Naming in Junior Guru is usually poultry-themed, and Honza felt that plucking is a nice analogy to web scraping.

License

AGPL-3.0-only, copyright (c) 2024 Jan Javorek, and contributors.