I need a lot of data ripped/mined from 2 different websites for archival purposes. One website has 15,000 pages, and the other website has 11,000 pages.
After you rip the individual pages, I need you to write a script to extract/parse the data from the html files and place it in a text delimited database file. You will also need to remove all strange characters from the data so that the text delimited data can be imported properly without error.
Each page will need the following extracted…. if you can't extract every field from every record, simply collect what you can. I'm fine with a few incomplete records.
Now for the hard part… Unfortunately, one of these websites have many protections in place protecting their data from data miners. They use some type of firewall connection limiting, that allows only a few connections from the same IP in a small time period before you are filtered. In addition, they block known proxies. This is a difficult job unless you are very tricky…
I was going to use a script similar to this to gather the data using a list of good proxies.
[url removed, login to view]
I'm just too busy and don't have the patience to finish this job. This data collection could take days, if not weeks I suppose, depending on how fast you are able to gather the data from the website that makes it difficult.
I don't believe the other website has the same data protections, but I have not tested in quite some time.
Private message me for the name of the sites I'm trying to data mine. Also, feel free to message with any other questions. I did not set a budget because I'm not really sure how difficult this job will end up being. Don't worry, I'm able to pay a fair amount for the work. Bid accordingly.