Download every PDF linked from a page using Python.

I wanted to download every agenda posted on the Detroit City Council website, but they were in different folders.

Happily, I there’s one page that lists all of them, so I wrote this short script:

import urllib2
import re
from BeautifulSoup import BeautifulSoup, SoupStrainer
import os
import time

# define the URL where all the links are:
url = "http://www.detroitmi.gov/legislative/CityClerk/2009add_cal.htm"
base_url = "http://www.detroitmi.gov/legislative/CityClerk/"
html = urllib2.urlopen(url).read()

# only select links with 'pdf' in the href
pdf_links = SoupStrainer('a', href=re.compile('pdf'))
soup = BeautifulSoup(html, parseOnlyThese = pdf_links)

for link in soup:
    link = base_url + link['href'] # build the full path to the PDF
    os.system("wget " + link)
    time.sleep(10) # wait a little while to be courteous

Comments (1) left to “Download every PDF linked from a page using Python.”

  1. Meg wrote:

    Matt. YES.

Post a Comment

*Required
*Required (Never published)