Using Beautifulsoup, Scrapy, and Selenium for relatively smaller projects.
We need data to work on data science project. Luckily, Internet is full of data. We can obtain those by fetching readily available data from source or call an API. Sometimes they are behind paywall, or data is not up to date. Then the only way to get data is from the website. For simple task copying and pasting does the trick. But for large data which is spread across several pages that is impractical. In those scenario web scraping can help you extract any kind of data that you want. This can be done in several ways. There are several packages for python to serve that specific need.
Out of those packages Beautifulsoup and Scrapy are the most popular ones. Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages, for other task it depends on other packages. On the other hand, Scrapy has a framework built around it for this specific purpose. Another popular option is Selenium. Although Selenium is not purpose built for this, it is for automated website testing. One byproduct of this is that we can use it for web scrapping.
In this article, I shall analyze the three most popular web scraping tools in Python by scraping a target website, so you can choose the one that suits best to your project. As a side note, always be respectful and aware of ethics when scraping and conform to instructions in robots.txt on the root of target website.
Now let me explain my process a bit. I am not using multiprocessing for Beautifulsoup, which enables to send multiple requests at once and will speed up the process. I am also not deploying spiders for Scrapy, although this is not the preferred way. The purpose of this article is to get performance metrics for relatively smaller project. In this test, both make use of requests library to obtain html and then process response.
I am using jupyter notebook for this analysis. My target website is http://books.toscrape.com/. I am scraping first 20 pages for 31 times using Beautifulsoup, Selenium, and Scrapy. Then looking at the processing time. All the processing time measurements are in seconds. Beautifulsoup is using bs4 moniker.
Now, here is the mean, standard deviation, min, max, and quartiles. Color code goes green to deep purple across rows for high to low values.
From the table we see bs4 and scrapy has similar performance beating selenium by a huge margin by looking at the mean. Scrapy is faster in terms of raw number. Differences between scrapy and bs4 is not that much in this setup.
Scrapy can be even faster if used to its full potential, i.e. setting up spider and run without dependencies that I used. Then again, for smaller project that is kind of an overkill. For smaller project where recurrent standalone use is not required then that is not necessary IMHO. Scrapy is too powerful and therefore is more complex to setup properly.
For visualization, here is the distribution of those data points, bin size 15. This reaffirms the findings.
In Addition, here is the normalized density of those data points, bin size 15.
Lets encapsulate the comparison:
Pros are it is fast, user friendly, and efficient.
Use for quick scaping.
Pros are it is fast, efficient, has powerful post processing, automation, Pipeline and middleware, resuming ability, can do multiple requests at once, has a robust community and well documented.
Cons are it is not beginner friendly, learning curve is much steeper, not resource hungry.
Use when reusable scraping script is needed, it is best for complex scraping.
Cons are it is not built for scrapping, not efficient for scraping.
Use when this is the only option for extract data from a complex website.
So, it boils down to what it is your use case scenario, there is no one ultimate solution. Keeping all those in your data science toolkit is handy when the situation demands. They all have their strength and weakness.
On a side note, Pandas can be used to scrape website if it contains table.
import pandas as pd# get data and store to dataframe
df = pd.read_html('http://example.com/')
Now as that is out of the way, lets look at the code that I used to get the result. Files can be found here on GitHub.
Now that I have all those as list, lets convert those to pandas data frame and do some analysis.
That is how I end up with all the findings and figures shown above.
That is all for today. Until next time!