• Experience with Web Data Platforms & Free Web Scraping Tools

    No Comments

    A lot happening all the time, a lot new platforms came into the horizon. Some saying they can do everything for you within a simple setup, some would say they are doing the best ever stuffs in the web scraping / extracting field and minimized your time and expense. But still the essence of a developer-based web scraping is much more demanding than those automated tools.

    I’m talking about the tools that are pretending to be “good for everything” or “flexible for every scraping”. But in reality if you start on their platform by configuring a website with 300K pages, most will fail..and more importantly if your requirements have interactive information pieces coming from AJAX or any socket calls whose results also being mapped by JavaScript statements or functions, these so called tools will fail. However, I must agree that they are good at parsing simple pages with plain HTML structures. They made life easier when it comes to crawling a list of simple pages or creating a sitemap. But we all know todays websites became more and more interactive binding multiple modern technologies together. The other issue is they get blocked by websites very quickly. For every reason they get blocked for, is manageable for a competent developer or team of web scraping.

    One of our new clients recently sent me a message saying “I’ve attempted to download and use at least 12 different programs and non of them do what is expected. I appreciate your help. “. We discussed his requirements and gave solutions that his business needed urgently, ofcourse within a reasonable cost and time. We hear from such unhappy webmasters almost every week and bring smiles on their faces after sometime.

    Before writing this post, we the team members created accounts on some of those platforms and tested them in various ways. The experience was very very time killing, we got wrong results and we had no way to check how many wrong results we received. So this is simply not for serious business and professional people. If you need to depend on your data, you only need the accurate one and if you have somebody responsible for that, you can stay relaxed.

    Please feel free to communicate with us to discuss any web scraping project.

  • RSS Parser & Aggregator for Drupal & WordPress

    No Comments

    We developed a complex RSS Parser and aggregator module for Drupal that can scrape given feeds and create nodes with proper versioning. It doesn’t only merge RSS feeds but also can hanle duplicate items according to the setup in the backend. The module is fully manageable from the backend. Later we developed a similar plugin for WordPress where each item has been added as a post. In both cases, you have ability to add & manage the custom feeds and their data. We have also prepared and deployed CRON version for both cases.

    Beside the above experience, we worked for a UK based property (MLS) website with a number of real estate agents feeding data into the site in a number of various formats including RMv3, BLM files, XML feeds etc. The website is built using Drupal. Our role was to build a Module to process agent feeds & some websites and parse them to feed to the Drupal system. Also, we managed their frontend website to deploy & display those processed data properly.