python - Scraping Multiple Websites with Single Spider using Scrapy -


i using scrapy scrape data this website. following code spider .

class stackitem(scrapy.item): def __setitem__(self, key, value):     if key not in self.fields:         self.fields[key] = scrapy.field()     self._values[key] = value  class betaspider(crawlspider):     name = "betaspider"       def __init__(self, *args, **kwargs):          super(betaspider, self).__init__(*args, **kwargs)          self.start_urls = [kwargs.get('start_url')]      rules = (rule (linkextractor(unique=true, allow=('.*\?id1=.*',),restrict_xpaths=('//a[@class="prevnext next"]',)), callback="parse_items", follow= true),)      def parse_items(self, response):         hxs = htmlxpathselector(response)         posts = hxs.select("//article[@class='classified']")         items = []          post in posts:             item = stackitem()             item["job_role"] = post.select("div[@class='uu mb2px']/a/strong/text()").extract()             item["company"] = post.select("p[1]/text()").extract()             item["location"] = post.select("p[@class='mb5px b red']/text()").extract()             item["desc"] = post.select("details[@class='aj mb10px']/text()").extract()             item["read_more"] = post.select("div[@class='uu mb2px']/a/@href").extract()             items.append(item)             item in items:                 yield item 

this code item pipelines:

class myexporter(object):  def __init__(self):     self.mycsv = csv.writer(open('out.csv', 'wb'))     self.mycsv.writerow(['job role', 'company','location','description','read more'])  def process_item(self, item, spider):     self.mycsv.writerow([item['job_role'], item['company'], item['location'], item['desc'], item['read_more']])      return item 

this working fine. now, have scrape following websites (for example) using same spider.

  1. http://www.freejobalert.com/government-jobs/
  2. https://www.sarkariexaam.com/

i have scrape tags of above mentioned websites, store csv file using item pipelines.

actually, list of websites scrapped endless. in project, user enter url , scrapped results returned user. so, want generic spider can scrape website.

for single website, working fine. but, how can accomplished multiple site having different structure ? scrapy enough solve it?

different spider better can use api run scrapy script, instead of typical way of running scrapy crawl remember scrapy built on top of twisted asynchronous networking library, need run inside twisted reactor.


Comments

Post a Comment

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -