Download every PDF linked from a page using Python.
I wanted to download every agenda posted on the Detroit City Council website, but they were in different folders.
Happily, I there’s one page that lists all of them, so I wrote this short script:
import urllib2
import re
from BeautifulSoup import BeautifulSoup, SoupStrainer
import os
import time
# define the URL where all the links are:
url = "http://www.detroitmi.gov/legislative/CityClerk/2009add_cal.htm"
base_url = "http://www.detroitmi.gov/legislative/CityClerk/"
html = urllib2.urlopen(url).read()
# only select links with 'pdf' in the href
pdf_links = SoupStrainer('a', href=re.compile('pdf'))
soup = BeautifulSoup(html, parseOnlyThese = pdf_links)
for link in soup:
link = base_url + link['href'] # build the full path to the PDF
os.system("wget " + link)
time.sleep(10) # wait a little while to be courteous
Meg wrote:
Matt. YES.
Posted on 26-Jan-10 at 10:05 pm | Permalink