This project is for a script/or other method to scrape data from a public website.
DO NOT BID UNLESS YOU HAVE DONE THESE TYPES OF PROJECTS BEFORE!!!
The script ideally:
1. must work on Redhat Linux via command line, but otherwise can be written in the language of your choice. You must provide any package/installation requirements to run the script successfully
a) crawl required pages
b) then parse & harvest for required data (I will provide the required data)
c) output data into a comma separated file
3. must use multi-threading to be able to crawl the pages in parallel with a configurable multi-threads attribute
Crawler should be able to mask its identity to prevent blocking.
Required scraped data must be extracted from:
[url removed, login to view]
The following data needs to be scraped from the above website in an efficient way:
All product Information (this data becomes visible, once you Enter zip code (use 95051) -> Shop by Aisle
* Aisle name (i.e. Baby)
* Sub-aisle category (i.e. Baby Accessories)
* Sub-sub-aisle category (i.e. Bottles & Nursing)
* Product Information
- Image (should be downloaded if available larger size)
- Item description
- Product Details
- Directions (if available)
- Nutritional Facts (if available)
- the remaining data should be categorized if available