python 2.7 - Need help extracting links from a TD in webpage -
i new @ python , trying hands @ building small web crawlers. trying code program in python 2.7 beautifulsoup extract profile urls page , subsequent pages
http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=1
here trying scrape urls linked details page, such this
http://www.bda-findadentist.org.uk/practice_details.php?practice_id=6034&no=61881
however, lost how make program recognize these urls. not within div class or id, rather encapsulated within td bgcolor tag
<td bgcolor="e7f3f1"><a href="practice_details.php?practice_id=6034&no=61881">view details</a></td>
please advise on how can make program identify these urls , scrape them. tried following, neither worked
for link in soup.select('td bgcolor=e7f3f1 a'): link in soup.select('td#bgcolor#e7f3f1 a'): link in soup.findall('a[practice_id=*]'):
my full program follows:
import requests bs4 import beautifulsoup def bda_crawler(pages): page = 1 while page <= pages: url = 'http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=' + str(page) code = requests.get(url) text = code.text soup = beautifulsoup(text) link in soup.findall('a[practice_id=*]'): href = "http://www.bda-findadentist.org.uk" + link.get('href') print (href) page += 1 bda_crawler(2)
please help
many thanks
Comments
Post a Comment