• New Technologies Like PhantomJS Added to Our Work

    No Comments

    We always do experiments with some technologies and finally start using some of them regularly when we regularly find them efficient for our work. PhantomJS (along with one of its related module named “scraperjs”) had been one such for a long time, then decided to use it for projects that might be made even better for its inclusion. In this post, I will try to inform you how this has made our web scraping much better than before. Not in all cases, but in many cases.

    What is PhantomJS?
    PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. PhantomJS is an optimal solution for Headless Website Testing, Screen Capture, Page Automation, Network Monitoring, etc.

    Among various features, we mainly used the “Screen Capture” and “Page Automation” features in our projects. The first one helped us capture screenshots (thumbnails, etc.) as well as fully rendered HTML contents. And the latter helped us manipulate the page contents with DOM API and jQuery. This “Screen Capture” is really great because you’ll feel like your script is navigating like a real browser. You can make your script wait only till your required content loads.

    PhantomJS has a lot of settings to help customize your requirements during the crawling process. You can set user agent, cookies, screen size for screenshots, timeout duration, you can inject your JS script and many other things.

    To take a quick look, check this page – http://phantomjs.org/page-automation.html

    PHP Integration with PhantomJS:
    We use a lot of PHP in our work. So we needed a way to work the PhantomJS capabilities with our PHP scripts. We struggled initially, but then we succeeded and now its producing all amazing results. PHP integration is useful in many ways, especially when need to keep track from a database and/or save results into the database or manipulate some results along with some other results that are processed by other complex PHP methods.

    The main usage of PhantomJS in our case is when we need to scrape a page that has so much JavaScript and/or AJAX dependencies and the final result is not found in the page source. For example: a game scoreboard that’s changing every few seconds, or a very restricted website that keeps some parts in an IFrame.

    That’s it for PhantomJS. We will share more in the coming days. Please feel free to contact us for any of your web scraping needs.

  • Scraping and Parsing Google Search Results with the PHP Simple HTML DOM Library

    39 Comments

    The title tells it a lot, but let us give you a view of this post. We have been planning for many months now that we would share our scraping knowledge to people interested in this field. This is because we feel that this field has a lot of opportunities than the number of real experts available. We already created quite a few scraping professionals in our region but the manual teaching could create only this few; we wanted to help a lot more others. Our today’s topic is on one of the most asked/requested crawling tasks e.g. Scraping Google Search. This post is intended for people already have some knowledge on PHP.

    To learn the intention of this post, please be sure you already have the PHP Simple HTML DOM library and took a look at the first page. Read the “Quick Start” block which has stated how to grab image links from a given page.

    In our example, we have searched Google by a keyword “Beautiful Bangladesh” and parsed the names & associated links from the results. Here is link we used to search the keyword in Google:

    http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh

    Then, we located the first result using Firebug inspector. It gives us the path to the DOM element for the result.

    Scraping Google Search Results

    Scraping Google Search Results

    You can see the result item is accommodated by an “h3″ element of class “r” under which the link is found. This can be easily captured by the HTML DOM Library as below:

    Code Screenshot

    Code Screenshot

    Here, we traversed through all links and extracted titles & links. We found direct link in “href”. But not always the links are straight, Google sometimes keep in a different structure, with their own reference and in a parameter. So we used a regular expression to match and extract the URL in that case. Please learn basics of Perl Compatible Regular Expressions (PCRE) if you’re new to PHP.

    Finally, my result looked like this (as of Jan 08, 2013):

    Google Search Results

    Google Search Results

    If you want the script as a file, please click follow this link:
    http://blog.proscraper.com/wp-content/uploads/2013/01/google_search.zip

    For developers – please feel free to ask questions, we are eager to help you learn technologies related to web scraping.

    For webmasters – please contact us to develop your complex web scraping solution.

    Thanks for reading.