python - Scraping Multiple Websites with Single Spider using Scrapy -
i using scrapy scrape data this website. following code spider .
class stackitem(scrapy.item): def __setitem__(self, key, value): if key not in self.fields: self.fields[key] = scrapy.field() self._values[key] = value class betaspider(crawlspider): name = "betaspider" def __init__(self, *args, **kwargs): super(betaspider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get('start_url')] rules = (rule (linkextractor(unique=true, allow=('.*\?id1=.*',),restrict_xpaths=('//a[@class="prevnext next"]',)), callback="parse_items", follow= true),) def parse_items(self, response): hxs = htmlxpathselector(response) posts = hxs.select("//article[@class='classified']") items = [] post in posts: item = stackitem() item["job_role"] = post.select("div[@class='uu mb2px']/a/strong/text()").extract() item["company"] = post.select("p[1]/text()").extract() item["location"] = post.select("p[@class='mb5px b red']/text()").extract() item["desc"] = post.select("details[@class='aj mb10px']/text()").extract() item["read_more"] = post.select("div[@class='uu mb2px']/a/@href").extract() items.append(item) item in items: yield item
this code item pipelines:
class myexporter(object): def __init__(self): self.mycsv = csv.writer(open('out.csv', 'wb')) self.mycsv.writerow(['job role', 'company','location','description','read more']) def process_item(self, item, spider): self.mycsv.writerow([item['job_role'], item['company'], item['location'], item['desc'], item['read_more']]) return item
this working fine. now, have scrape following websites (for example) using same spider.
i have scrape tags of above mentioned websites, store csv file using item pipelines.
actually, list of websites scrapped endless. in project, user enter url , scrapped results returned user. so, want generic spider can scrape website.
for single website, working fine. but, how can accomplished multiple site having different structure ? scrapy enough solve it?
different spider better can use api run scrapy script, instead of typical way of running scrapy crawl remember scrapy built on top of twisted asynchronous networking library, need run inside twisted reactor.
This comment has been removed by the author.
ReplyDeleteBuy Dishwash Liquid Online at Best Prices | Happy Ganga
ReplyDeleteBest Surface Cleaner