A couple of weeks ago, I had a need to parse some parts of some web page. It was a page whose main content is loaded after finishing the GET request to the page. This means that python’s urllib, urllib2 and requests packages will fail to download the exact same source as your Chrome renders when you reach the site via browser (not programmatically). Because url libraries usually return the content after GET request finishes, they don’t wait until all of the ajax calls finish. This causes discrepancy between fetched code and rendered code. In my case, I needed the rendered code. After long investigation and asking out in StackOverflow (nobody replied and question is deleted), I ended up using Selenium to emulate normal browser behaviour. It was kinda hard for me because I need async flash content and firefox driver (default driver) of Selenium is not capable of rendering flash content. I switched to Chrome driver. This step just made things harder.
- Install PhantomJS
sudo easy_install selenium
sudo easy_install pyvirtualdisplay
sudo apt-get install xvfb
- Firefox and phantomjs are not capable of showing the flash videos. Chrome is the only successful one.
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
sudo apt-get update
sudo apt-get install google-chrome-stable
Chrome driver http://chromedriver.storage.googleapis.com/index.html
Make sure chrome is install at /usr/bin/google-chrome
- Snippet (with various browser choices, remove the comment of the one you want to use and comment out the other ones):
#with closing(Firefox()) as browser:
#with closing(Chrome()) as browser: # PhantomJS
#with closing(webdriver.Remote("http://localhost:9515", desired_capabilities=DesiredCapabilities.CHROME)) as browser: # Fake chrome.
# Don't know why it's not working with relative path. Pass absolute!
with closing(webdriver.Chrome('/home/haku/chromedriver')) as browser: # Fake chrome.
page_source = browser.page_source
- Full Code (including a small web.py server in order to parse regexp from a site which has ajax loaded content): https://github.com/hakanu/selenium_scraper
Raspberry pi does not have firefox package instead there is this iceweasel and gnash plugin for flash:
sudo apt-get install iceweasel browser-plugin-gnash
In order to install chrome on Raspberry pi for normal usage:
sudo apt-get install chromium-browser
However, these methods didn’t make my scraper go through the page that I want to parse because of the lack of the flash support.