We need to extract data and images from several catalogs on the Internet. Each catalog contains hundreds of records of items with fields such as Title & Description, Price, and an Image. These must be tabulated and presented in a folder per catalog containing:
1) An excel file with 3 columns for the Title & Description, Price, and Image Name, and
2) A separate file which will contain the images and named exactly as indicated in the corresponding excel file under "Image name"
An example or more details can be provided upon request, so please ask if you are not sure before bidding on this project.
**I have a total of about 80 - 100 separate URLs (each with 20 - 30 catalogs of 100 - 1200 records with images) to be extracted, so I will accept bids from multiple coders to minimize the time it will take to complete this project. Please bid for as many URLs as you wish and state your price for the job (*NOT* per URL).**
**IMPORTANT ** - You must have experience with extraction patterns. Although most catalogs are fairly easy, some can be challenging and often require deep scanning to open images or to flip through pages. Writing the extraction pattern code should normally take 5-10 minutes. However, the actual data extraction takes several hours, but it is automatic and can be done on a separate computer or server.
Please ask any questions...
## Deliverables
So that you can estimate the work involved, we will provide the URLs of where the catalogs are posted. Each URL contains at least 20 - 30 separate catalogs and each catalog contains anywhere from 100 - 1200 records (the average typical catalog is about 300 - 600 records long).
All catalogs for each URL will have the same pattern, so you only need to determine the extraction pattern once for each URL and then apply it to each catalog under the same URL. However, because this extraction also involves images, please note that it can be time consuming and ***it is advised that you have a fairly fast Internet connection and separate computer that you can use to run the extractions so that they do not conflict with your other work you may have.***
There will be some filters that will be included, which will make the extraction shorter. For example, only items that also have images should be extracted. Records with no images can be skipped. Also, only items that have a price of greater than 0 (i.e. have actually sold) should be extracted.