SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python 3 and leveraged the Scrapy framework maintained by Scrapinghub.
INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this Take1 iteration, the Python script attempts to follow the page links and scrape the quote information off each page.
Starting URLs: http://quotes.toscrape.com/
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): self.log('I just visited: ' + response.url) for quote in response.css('div.quote'): item = { 'author_name': quote.css('small.author::text').extract_first(), 'text': quote.css('span.text::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), 'author_url': quote.css('div.quote > span > a::attr(href)').extract_first(), } yield item # follow pagination link next_page_url = response.css('li.next > a::attr(href)').extract_first() if next_page_url: next_page_url = response.urljoin(next_page_url) yield scrapy.Request(url=next_page_url, callback=self.parse)
The source code and JSON output can be found here on GitHub.