A couple of weeks ago, I had a need to parse some parts of some web page. It was a page whose main content is loaded after finishing the GET request to the page. This means that python’s urllib, urllib2 and requests packages will fail to download the exact same source as your Chrome renders when you reach the site via browser (not programmatically). Because url libraries usually return the content after GET request finishes, they don’t wait until all of the ajax calls finish. This causes discrepancy between fetched code and rendered code. In my case, I needed the rendered code. After long investigation and asking out in StackOverflow (nobody replied and question is deleted), I ended up using Selenium to emulate normal browser behaviour. It was kinda hard for me because I need async flash content and firefox driver (default driver) of Selenium is not capable of rendering flash content. I switched to Chrome driver. This step just made things harder.

Prereqs

  • Install PhantomJS
  • sudo easy_install selenium
  • sudo easy_install pyvirtualdisplay
  • sudo apt-get install xvfb
  • Firefox and phantomjs are not capable of showing the flash videos. Chrome is the only successful one.
  • Install chrome

  • wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
  • sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
  • sudo apt-get update
  • sudo apt-get install google-chrome-stable
  • Chrome driver http://chromedriver.storage.googleapis.com/index.html

  • wget http://chromedriver.storage.googleapis.com/2.10/chromedriver_linux64.zip
  • unzip chromedriver_linux64.zip

  • Make sure chrome is install at /usr/bin/google-chrome

  • python selenium_scraper.py

  • Snippet (with various browser choices, remove the comment of the one you want to use and comment out the other ones):

# use firefox to get page with javascript generated content #with closing(Firefox()) as browser: #with closing(Chrome()) as browser: # PhantomJS #with closing(webdriver.Remote("http://localhost:9515", desired_capabilities=DesiredCapabilities.CHROME)) as browser: # Fake chrome. # Don't know why it's not working with relative path. Pass absolute! with closing(webdriver.Chrome('/home/haku/chromedriver')) as browser: # Fake chrome. browser.get(url) WebDriverWait(browser, timeout=30).until( EC.presence_of_element_located((By.NAME, element_name_to_check))) page_source = browser.page_source

  • Full Code (including a small web.py server in order to parse regexp from a site which has ajax loaded content): https://github.com/hakanu/selenium_scraper

Appendix:

  • http://stackoverflow.com/questions/19015870/using-with-python-selenium-webdriverwait-in-pysaunter-for-async-pages
  • http://selenium-python.readthedocs.org/en/latest/waits.html
  • http://stackoverflow.com/questions/7593611/selenium-testing-without-browser
  • http://selenium-python.readthedocs.org/en/latest/locating-elements.html
  • https://code.google.com/p/selenium/wiki/ChromeDriver
  • https://code.google.com/p/selenium/wiki/PythonBindings
  • http://selenium-python.readthedocs.org/en/latest/getting-started.html
  • http://stackoverflow.com/questions/22558077/unknown-error-chrome-failed-to-start-exited-abnormally-driver-info-chromedri
  • http://stackoverflow.com/questions/22424737/message-uunknown-error-chrome-failed-to-start-exited-abnormally

Raspberry Pi

Raspberry pi does not have firefox package instead there is this iceweasel and gnash plugin for flash: sudo apt-get install iceweasel browser-plugin-gnash

In order to install chrome on Raspberry pi for normal usage: sudo apt-get install chromium-browser

However, these methods didn’t make my scraper go through the page that I want to parse because of the lack of the flash support.