March 4th, 2016 |
Screen Scraping, Web Crawling, Web Scraping, Web Scraping Technology
PhantomJS . PhantomJS for Screenshot . PhantomJS for Web Scraping . PhantomJS with PHP
We always do experiments with some technologies and finally start using some of them regularly when we regularly find them efficient for our work. PhantomJS (along with one of its related module named “scraperjs”) had been one such for a long time, then decided to use it for projects that might be made even better for its inclusion. In this post, I will try to inform you how this has made our web scraping much better than before. Not in all cases, but in many cases.
What is PhantomJS?
Among various features, we mainly used the “Screen Capture” and “Page Automation” features in our projects. The first one helped us capture screenshots (thumbnails, etc.) as well as fully rendered HTML contents. And the latter helped us manipulate the page contents with DOM API and jQuery. This “Screen Capture” is really great because you’ll feel like your script is navigating like a real browser. You can make your script wait only till your required content loads.
PhantomJS has a lot of settings to help customize your requirements during the crawling process. You can set user agent, cookies, screen size for screenshots, timeout duration, you can inject your JS script and many other things.
To take a quick look, check this page – http://phantomjs.org/page-automation.html
PHP Integration with PhantomJS:
We use a lot of PHP in our work. So we needed a way to work the PhantomJS capabilities with our PHP scripts. We struggled initially, but then we succeeded and now its producing all amazing results. PHP integration is useful in many ways, especially when need to keep track from a database and/or save results into the database or manipulate some results along with some other results that are processed by other complex PHP methods.
That’s it for PhantomJS. We will share more in the coming days. Please feel free to contact us for any of your web scraping needs.