I need someone that can write a program, script, crawler, ect of some sort that can crawl through an existing website and extract a large
amount of data from that site. (Not the HTML Code itself)The data that I need to collect is very organized on the website and should not be
that hard to accomplish.
The finished product should produce at least 700 excel files organized in folders on a server that I will specify later. Also, once the
master list is organized, I will need to update the data inside the excel files when the information on the website changes. I also will be
adding new excel files when needed if it does not exist when scanning for updates. Here is how it needs to go be: Of course this can change
if you think you know a more efficient way of doing it.
1. Go to website
2. Choose master list of all. This will show you all of the final Excel file names.
3. Each Excel file will be 2 columns only. As stated the final result should produce over 700 Excel files. Each Excel file will contain a
massive amount of data. Some will contain a very minimal amount of data. The following is the final result for an Excel File:
Column A Column B
EXAM Example EXAM 1000 Example Data
EXAM Example EXAM 1001 Example Data
EXAM Example EXAM 1002 Example Data
EXAM Example EXAM 1003 Example Data
EXAM Example EXAM 1004 Example Data
DEMO Demonstration DEMO 1000 Demonstration Data
DEMO Demonstration DEMO 1001 Demonstration Data
DEMO Demonstration DEMO 1002 Demonstration Data
DEMO Demonstration DEMO 1003 Demonstration Data
DEMO Demonstration DEMO 1004 Demonstration Data
4. Once you choose the master list, you will then get another list in Alphabetical Order. The program will have to then choose a link at the
tope of the navigation. Choosing this list will take you to a different URL on the same domain. This will give you all of the data for column
A. It will be a 2 column table in alphabetical order. You will then have to click on the first option in the in the field which will bring
you all of the data for column B which is how it is associated with column A. Once you have that information recorded you have to go back to
the next data from Column A and then get its associated data. This loop will continue until you reach the end of column A. Once you reach the
end and all of the data has been extracted you will have one complete Excel file.
5. You will then have to go back to the master list and select the second choice and then loop though steps 3 and 4. This will happen until
you loop through the whole master list.
As I said before, this is what needs to happen. I really don't care how it is programmed. The final product should produce Excel files, or
update them. If for some reason when updating an excel file you find that it is no longer listed in the master list, it should just use the
previous excel file and not delete the existing one. There should also be a text file created that summaries what was completed, if an excel
file was updated, or a new one added. We need to know when a new excel file is added, updated, or deleted. This is very important and is
needed considering that this list will have at least 700 separate files.
A final Example Excel file will be provided as an example. We need this to be done by some type of program. Doing this by hand normally takes
about 3-4 hours per Excel file. Besides being unproductive, it is very boring. We would like to get this done by the end of the May. Please
let us know if you have any questions.