333809 URL harvesting script
N/A
Maksettu toimituksen yhteydessä
I need a script or desktop application (for windows vista) to harvest website addresses (URLs) for me. My preference is a script that runs in PHP and MYSQL on a Linux server.
I want to enter a list of keyword phrases like "cheap hosting" and "custom furniture". Typically there will be a few hundred of these at a time and I want to be able to add and delete phrases.
When I run the script (let's call that a scan), it must do the following - get website addresses from the following sources for me (using "cheap hosting" as an example) -
1) [url removed, login to view] - all the results (not just the first page of results)
2) The first 100 results from [url removed, login to view] - filtered to show only the URLs that actually have one or more of the keywords in the domain name itself, but not as part of a subdomain. (So [url removed, login to view] and [url removed, login to view] is ok, but not [url removed, login to view], and not [url removed, login to view])
3) The first 100 results from Google for [url removed, login to view]
These results must go into a database in the form of [url removed, login to view] (NOT [url removed, login to view]) seperated into the 3 categories above and the date that the script was run.
There will be function where I can enter from time to time (for a specific keyword phrase) multiple (anything from 10 - 1000 at a time) URLS in the form of website.com. If these URLs match existing URLs for that specific keyword phrase, it must be marked as "used". I also need to be able to mark/reset URLs to unused/default.
Then I need a function to make a report of all the unique results gathered for a specific keyword between date x and Date Y (I will enter these values) that is not marked as used. The report must be in CSV format with these fields -
keyword phrase
url
source
That is the basic functionality of the script. Other features will include -
1)There will be a general filter applicable to all keyword phrases where I want to enter URLs that I do not want collected.
2) I want to run the scans in bulk, by selecting which keyword phrases to scan, at the end of the scan the script must give a report of the number of NEW results per keyword phrase that were collected during that scan that have not been collected before.
3)The script must have a setting to specify a random delay in seconds between searches to avoid being blocked by Google.
Projektin tunnus: #2079619