Can Scrapy Be Used To Scrape Dynamic Content From Websites That Are Using AJAX?


Answer :

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):

enter image description here

When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

enter image description here

It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

enter image description here

And I observe the HTTP request that is responsible for message body:

enter image description here

After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):

enter image description here

And the form data content of the request (the HTTP method is "Post"):

enter image description here

And the content of response, which is a JSON file:

enter image description here

Which presents all the information I'm looking for.

From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:

class spider(BaseSpider):     name = 'RubiGuesst'     start_urls = ['http://www.rubin-kazan.ru/guestbook.html']      def parse(self, response):         url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)         yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,                           formdata={'page': str(page + 1), 'uid': ''})      def RubiGuessItem(self, response):         json_file = response.body 

In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.


Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

enter image description here

In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.


Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

Some things to note:

  • You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

  • This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

    from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.http import Request  from selenium import selenium  class SeleniumSpider(CrawlSpider):     name = "SeleniumSpider"     start_urls = ["http://www.domain.com"]      rules = (         Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),     )      def __init__(self):         CrawlSpider.__init__(self)         self.verificationErrors = []         self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")         self.selenium.start()      def __del__(self):         self.selenium.stop()         print self.verificationErrors         CrawlSpider.__del__(self)      def parse_page(self, response):         item = Item()          hxs = HtmlXPathSelector(response)         #Do some XPath selection with Scrapy         hxs.select('//div').extract()          sel = self.selenium         sel.open(response.url)          #Wait for javscript to load in Selenium         time.sleep(2.5)          #Do some crawling of javascript created content with Selenium         sel.get_text("//div")         yield item  # Snippet imported from snippets.scrapy.org (which no longer works) # author: wynbennett # date  : Jun 21, 2011 

Reference: http://snipplr.com/view/66998/


Comments

Popular posts from this blog

Converting A String To Int In Groovy

"Cannot Create Cache Directory /home//.composer/cache/repo/https---packagist.org/, Or Directory Is Not Writable. Proceeding Without Cache"

Android SDK Location Should Not Contain Whitespace, As This Cause Problems With NDK Tools