Run Scrapy code from Jupyter Notebook without issues

Tamjid Ahsan
Towards Data Science
3 min readJul 26, 2021

--

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner. Data scientists usually prefer some sort of computational notebook for managing their workflow. Jupyter Notebook is very popular amid data scientists among other options like PyCharm, zeppelin, VS Code, nteract, Google Colab, and spyder to name a few.

Scraping using Scrapy is done with a .py file often. It can be also initialized from a Notebook. The problem with that is, it throws an error `ReactorNotRestartable:` when the code block is run for the second time.

Photo by Clément Hélardot on Unsplash

There is a work-around for this error using crochet package. ReactorNotRestartable error can be mitigated using this package. In this blog post, I am showing the steps that I took to run scrapy codes from Jupyter Notebook with out the error.

Prerequisites:

scrapy : pip install scrapy

crochet : pip install crochet

Any Notebook for python, I am using Jupyter Notebook : pip install notebook

Demo Project:

For demoing the steps, I am scraping wikiquote for quotes by Maynard James Keenan, an American rock singer, and saving the info as a .csv file, which will be overwritten every time the script is run, this is useful for a fresh start of the project. This is achieved using the custom settings and passing a nested dictionary with FEEDS as key and a dictionary with name of the output file as key and values containing different settings for the feed.

To initialize the process I run following code:

process = CrawlerProcess()
process.crawl(QuotesToCsv)
process.start()

It runs without issue for the first time and saves the csv file at the root, but throws following error from the next time onwards.

`ReactorNotRestartable` error, image by Author.

To run the code without issue again, the kernel must be restarted. Now with the use of crochet, this code can be used in a Jupyter Notebook without issue.

Now, I call this function to run the codes without issue.

run_spider()

Now let me go through the differences between those two approaches:

  1. Using CrawlerRunner instead of CrawlerProcess .
  2. Importing setup and wait_for from crochet and initializing using setup() .
  3. Using @wait_for(10) decorator on the function that runs the spider from scrapy. @wait_for is used for blocking calls into Twisted Reactor thread. Click here to learn more about this.

Voila! No more error. The script runs and saves output as quotes.csv .

This can also be done from a .py from Jupyter Notebook using !python scrape_webpage.py , if the file contains the script. Being said that, it is convenient to develop code from a Notebook. Also, one caveat of this approach is that there is way less log if using CrawlerRunner instead of CrawlerProcess .

Photo by Roman Synkevych on Unsplash

Here is the GitHub repo of all the codes and notebooks that I used to test out my workflow.

Until next time!!!

--

--