Web Scraping of ScrapingHub Blog Entries using Python

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping code was written in Python 3 and leveraged the Scrapy framework maintained by Scrapinghub.

INTRODUCTION: ScrapingHub, the maker of the Scrapy framework, hosts its blog at blog.scrapinghub.com. The purpose of this exercise is to practice web scraping using Scrapy by gathering the blog entries from Scrapinghub. The script also would automatically traverse from one page of the blog entries to the next page.

Starting URLs: https://blog.scrapinghub.com/

import scrapy

class ScrapinghubSpider(scrapy.Spider):
    name = 'scrapinghub'
    allowed_domains = ['scrapinghub.com']
    start_urls = ['https://blog.scrapinghub.com/']

    def parse(self, response):
        self.log('I just visited: ' + response.url)
        for blog in response.css('div.post-item'):
            item = {
                'blog_title': blog.css('div.post-header > h2 > a::text').extract_first(),
                'blog_url': blog.css('div.post-header > h2 > a::attr(href)').extract_first(),
                'date': blog.css('div.post-header > div.byline > span.date > a::text').extract_first(),
                'author': blog.css('div.post-header > div.byline > span.author > a::text').extract_first(),
                'summary': blog.css('div.post-content > p::text').extract_first(),
            yield item

        # follow pagination link
        next_page_url = response.css('div.blog-pagination > a.next-posts-link').xpath('@href').extract_first()
        if next_page_url:
            self.log('Moving on to next page: ' + next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

The source code and JSON output can be found here on GitHub.