SUMMARY: The purpose of this project is to practice web scraping by extracting specific information from a website. Using the extracted information, the script further completes other tasks (downloading files in this case). The web scraping python code leverages the BeautifulSoup module.
INTRODUCTION: On occasions, there is a need to download a batch of documents off web pages without clicking on the download links one at a time. This web scraping script will automatically traverse through the necessary web pages and collect all links with the PDF document format. The script will also download the PDF documents as part of the scraping process.
For this script to work, it requires the use of Selenium browser automation software and one of its WebDrivers (Firefox in this case).
Starting URLs: https://docs.aws.amazon.com/
The source code and JSON output can be found here on GitHub.