A PHP script reads domains to be crawled from the database table t_domain. The script must honor the domain’s robot.txt. We want to recursively collect all links (html a element) from the domain up to a depth of 5 from the entry point. Only local links should be followed. Only links to text/html should be followed (via header check). Only follow up to 100 links per page. Do not wait longer than 10 seconds for a page to load. Every link found (either local or pointing to a different domain) will be stored in the table t_links. The following things should be stored: timestamp of crawl, full URL, the ID from t_domain of the domain the link was found on, the ID from t_domain the link points to. If the destination domain does not exist yet it must be added to t_domains.
Once a domain has been completely crawled a timestamp is added to the domain in t_domain.
Then the next domain is select from t_domain to be crawled. The next domain is defined aa having no timestamp and having the lowest id.
This does not have to be completely from scratch. We recommend using an existing framework like: [login to view URL] or [login to view URL] or another project of your choosing. The important part for us is to collect the links and the domains.
We will provide a server with PHP installed and a database, preferably MySQL. This server can be used for testing.