• Launching of Our Data Marketplace for Professional Data Services

    No Comments

    For last few months, we have been preparing for a marketplace of our own to sell our existing datasets / databases. Finally it happened recently, we now have the platform ready for you with about 45 datasets in the beginning. You can visit the marketplace here – http://data.proscraper.com

    Data Marketplace

    (Screenshot: Data Marketplace)

    While we are developing more databases for this platform from our decade-long experiences, we have data items in several categories at this moment:
    - Amazon
    - Business Listing
    - Clothing & Accessories
    - Educational
    - Financial
    - Locations
    - Real Estate
    - Shopping
    - Yellow Pages
    - Food & Dining
    - Entertainment
    - Health & Medicine
    - Professionals
    - Service Providers
    - Sports

    However, we are not limited to these categories only, not even limited to the listed data items. You can customize your need with us.

    In the detail page of each of our data item, we have options to show the data count, data fields, etc. as well as option to download a sample CSV file.

    Sample Data Alexa Top Sites

    (Screenshot: Sample Data Alexa Top Sites)

    The given screenshot above was from our ‘Alexa Top Sites‘ data item’s page. The number in ‘Data Count’ is customizable (extra charge may or may not require), we gave one decent size that are often asked for.

    We will post more updates from our ‘Data Marketlace’ here as it happens. Please suggest anything that might come to our help.

  • Our Recent Experiences with Price Comparison Tools

    No Comments

    We have been developing price comparison tools for many years now, but never shared any of our experiences here or on any other blog of us. However, this year we have developed two (2) different kind of price comparison tools for 2 different customers whose requirements were very custom for their own.

    I will share both of the experiences here today. One was expecting a comparison between different versions of same products from same source. They basically needed to track changes of price and some other attributes (like colors & sizes) of each product when it comes from their competitors websites. This way they became able to set their own prices for the respective items of their store. We even created an email template to let them know the changed products every morning to their email. We developed a very simple web panel for this customer as they said the view was not very important where data are more important.

    Price Comparison Simple Panel View

    Price Comparison Simple Panel View

    The other experience was related to golf pricing comparison. The client wanted to launch an application where they can show the golf club owners a comparative pricing between his/her club’s course and some closest courses, chosen or configured by them. For active subscribers, the system prepares reports for their choices and sends them email regularly in the morning. This helps them get the competitors prices in a single window at a glance in the morning.

    Price Comparison Smart View for Golf Clubs

    Price Comparison Smart View for Golf Clubs

    In both cases, our job was to softly pull complex information from target sources and then parse them as per the requirements before saving to the database. The scraping frequency was defined by the customer’s requirement.

    If you need a similar solution or any sort of related help, please don’t hesitate to contact us.

    Best regards.

  • New Technologies Like PhantomJS Added to Our Work

    No Comments

    We always do experiments with some technologies and finally start using some of them regularly when we regularly find them efficient for our work. PhantomJS (along with one of its related module named “scraperjs”) had been one such for a long time, then decided to use it for projects that might be made even better for its inclusion. In this post, I will try to inform you how this has made our web scraping much better than before. Not in all cases, but in many cases.

    What is PhantomJS?
    PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. PhantomJS is an optimal solution for Headless Website Testing, Screen Capture, Page Automation, Network Monitoring, etc.

    Among various features, we mainly used the “Screen Capture” and “Page Automation” features in our projects. The first one helped us capture screenshots (thumbnails, etc.) as well as fully rendered HTML contents. And the latter helped us manipulate the page contents with DOM API and jQuery. This “Screen Capture” is really great because you’ll feel like your script is navigating like a real browser. You can make your script wait only till your required content loads.

    PhantomJS has a lot of settings to help customize your requirements during the crawling process. You can set user agent, cookies, screen size for screenshots, timeout duration, you can inject your JS script and many other things.

    To take a quick look, check this page – http://phantomjs.org/page-automation.html

    PHP Integration with PhantomJS:
    We use a lot of PHP in our work. So we needed a way to work the PhantomJS capabilities with our PHP scripts. We struggled initially, but then we succeeded and now its producing all amazing results. PHP integration is useful in many ways, especially when need to keep track from a database and/or save results into the database or manipulate some results along with some other results that are processed by other complex PHP methods.

    The main usage of PhantomJS in our case is when we need to scrape a page that has so much JavaScript and/or AJAX dependencies and the final result is not found in the page source. For example: a game scoreboard that’s changing every few seconds, or a very restricted website that keeps some parts in an IFrame.

    That’s it for PhantomJS. We will share more in the coming days. Please feel free to contact us for any of your web scraping needs.

  • Alexa Top One Million Websites’ Details

    No Comments

    We have a new scraper now running for last 6 months (so sorry we often forget to share many things here) for a few clients of us, which can collect details of Alexa’s top one million websites. It is designed for people that needs to analyze websites time to time. Our scraper collects information fields like categories, demographics, keywords, owner contact information (name, email & phone) as well as the general details like global rank, country rank, etc. Currently we scrape the data once a month to deliver updated feed to our clients every month.

    Please take a look at the screenshots:

    Alexa Top 1 Million Websites

    See another for the other fields:

    Alexa Top Websites

    Please feel free to contact us for this service or any similar solution.

  • Experience with Web Data Platforms & Free Web Scraping Tools

    No Comments

    A lot happening all the time, a lot new platforms came into the horizon. Some saying they can do everything for you within a simple setup, some would say they are doing the best ever stuffs in the web scraping / extracting field and minimized your time and expense. But still the essence of a developer-based web scraping is much more demanding than those automated tools.

    I’m talking about the tools that are pretending to be “good for everything” or “flexible for every scraping”. But in reality if you start on their platform by configuring a website with 300K pages, most will fail..and more importantly if your requirements have interactive information pieces coming from AJAX or any socket calls whose results also being mapped by JavaScript statements or functions, these so called tools will fail. However, I must agree that they are good at parsing simple pages with plain HTML structures. They made life easier when it comes to crawling a list of simple pages or creating a sitemap. But we all know todays websites became more and more interactive binding multiple modern technologies together. The other issue is they get blocked by websites very quickly. For every reason they get blocked for, is manageable for a competent developer or team of web scraping.

    One of our new clients recently sent me a message saying “I’ve attempted to download and use at least 12 different programs and non of them do what is expected. I appreciate your help. “. We discussed his requirements and gave solutions that his business needed urgently, ofcourse within a reasonable cost and time. We hear from such unhappy webmasters almost every week and bring smiles on their faces after sometime.

    Before writing this post, we the team members created accounts on some of those platforms and tested them in various ways. The experience was very very time killing, we got wrong results and we had no way to check how many wrong results we received. So this is simply not for serious business and professional people. If you need to depend on your data, you only need the accurate one and if you have somebody responsible for that, you can stay relaxed.

    Please feel free to communicate with us to discuss any web scraping project.

  • Our Horse Racing Scraper is Ready for your Subscription now

    No Comments

    We have been scraping horse racing results for quite a few years now. But it wasn’t ready for subscription. However, we have prepared it for your subscription now. We have currently set the racing for the following countries:
    - USA
    - UK
    - Singapore
    - Hong Kong
    - South Africa
    - Malaysia
    - Zimbabwe
    - Mauritius

    Upon subscription, we get you access to the following:
    - All past results (to date) from your subscribed countries
    - An “On Demand Scraper” to help you instantly scrape results of any date from your subscribed countries
    - Facility to get your results as CSV or in your email
    - Facility to host and analyze your results in the given panel

    These are the current features, but we will be adding more over time based on your feedback and expectations.

    Take a look at our On Demand Scraping:

    Horse Racing Results Screenshot

    Horse Racing Results

    You can download & check a sample CSV result as well: Sample Racing Results

    Our subscription rate is very reasonable:
    Package # 1: For two or less countries -> $30/month
    Package # 2: For four countries -> $40/month
    Package # 3: For all countries -> $55/month

    If you report us any downtime of our service, we will deduct some fair amount from your invoice.

    Please contact us for more detail.

  • Custom Aggregation Services Using Drupal Feeds Module

    1 Comment

    We have customized Drupal’s feeds module to aggregate custom data from multiple sources including websites, feeds, local or remote databases, local files or medias, etc. We applied huge customization to add ability to do many things. For example -  our customization for the “Fetcher” adds strength to download from complex sites requiring a login to access data. Also, we added ability to parse many difficult data types in the parser. And our customization on the “Processor” can insert/update multiple nodes/destinations & serve many purposes like rating update, referencing other nodes, posting to twitter/facebook, and many more.

    Please feel free to contact us for any custom aggregation services in your Drupal site.