Web Scraping of Data.gov Dataset Catalog Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Data.gov is a government data repository website managed and hosted by the U.S. General Services Administration. The purpose of this exercise is to practice web scraping by gathering the dataset entries from Data.gov’s web pages. This iteration of the script automatically traverses the web pages to capture all dataset entries and store all captured information in a JSON output file.

Starting URLs: https://catalog.data.gov/dataset

The source code and HTML output can be found here on GitHub.