There are five websites/search results pages we would like to scrap.
All 5 sites are either search engines or have search functionality, which is how you would target for extraction:
We will want to pass a query on to the search engines, scrap the first page of results, and save them as XML (one site will need more than one page). More pages of results should be available upon user request. Our own search engine has some results, but we would like to supplement them with what you extract from the other sites. (Think of a filtered search engine.)
We have 15-20 IPs and expect you to use them so that you do not overwhelm any of the 5 sites with too many queries over a period of time. The servers you will be running your code are CentOS 5 and all setup.
We expect that you will have extensive experience in data mining and extraction. I can give you the five targets in a PM if you ask. The scope of this project will not change. It is documented and your deliverables are clearly annotated.