Home > Cloud Computing, Projects > Project Added – ‘PV-Spiders’

Project Added – ‘PV-Spiders’

I feel like I’m getting a bit nostalgic all of a sudden.. 🙂

PV-Spiders

This is another product I created for Prime Vendor in Wilmington, NC in order to collect all the bid information offered by any government agency in the United States.

The core problem here was that there’s somewhere around hundreds of thousands of organizations in the United States government that post bids for private contractors to, well, bid upon.  Many of these organizations post multiple new bids every day – some of which are streamed out of a database but, surprisingly, lots of these government agencies still relied on human beings to update their tables of offers.  When your job is to create a network of ‘spiders’ that go to these pages and are just supposed to check if anything has changed since their last visit and, if so, download the new data and submit it as a new bid.. well, there are often complications.

All the people before me faced with this problem created programs that’d establish HTTP requests to these sites and then attempted to parse the html text that came back in a meaningful way where they could then either download the bid or continue on to the next stage of the website until they can, hopefully, inevitably get to the direct bid file.  Many sites required you to login and provided hundreds to thousands of results over the course of several html pages – and many others were almost entirely javascript which these spiders just couldn’t go to.

Being a little lazy, I decided to forgo all of this nastiness and instead chose to extend the built in .NET WebBrowser control to the point where you could register parsing events to given Uris and, whenever the browser control would hit one of these Uris, the event would be called and you’d have the entire page parsed for you already in the form of the DOM (Document Object Model).  You could then use Linq (or old-school for-loops, should you prefer) to extract out whatever information you needed from this page in a handful lines of code.  You could even execute JavaScript.  This entire overhaul of the WebBrowser control took about 50 lines of code and less than an hour to develop.

(click for more)

  1. May 14, 2011 at 3:27 am

    Project Added – ‘PV-Spiders’
    is very key

    i think so.

  1. No trackbacks yet.

Leave a comment