1) Robots get a list of URLs -- called IndexURLs later on. This list of IndexURLs is modified regularily so that the Robots need to evaluate it when changed.
2) Based on each IndexURL a new set of URLs is generated -- called DetailURL later on. The DetailURL is where the relevant Information for a new Event, Location or Performer is to be extracted.
1) The pattern for location the DetailURLs needs to be configurable, e.g. in the list of IndexURLs
2) The list (could be a queue) of DetailURLs to be processed is to be updated daily.
3) Already processed DetailURLs must be skipped and not added to the list again.
3) Processing of each DetailURL
1) from a new DetailURL the contents are extracted
1) The Robots must be able to determine the language and type of item (Event, Location, Performer), copyright-safe (adding copyrighted contents but flagging it) based on clear criterias which can be easily handeled automatically.
2) The API should be used in order to check if the item is already present in the system (This case needs to be specified in more detail!!!!!!)
3) The API should be used in order to add the new item
4) Add the processed DetailURL to the Scheduler for Processing in order to check the event itself, the location and performer 1 week, 3 days and on same day of the event.
5) save a hash of the contents for later comparision
6) templates/patterns for locations, events and performers must be configurable for each IndexURL
2) from a old DetailURL (=scheduled) the contents are extracted
1) check with the previously stored hash value
2) if different update the item, else finished
1) the current status must be visualized (which IndexURL/DetailURL is processed, errors occured during extraction or API invocation etc.)
2) API calls must be logged and visualized
3) Processing the IndexURLs and DetailURLs must be logged and visualized
4) List of IndexURLs and DetailURLs mus be visualized and searchable
5) Editors must be able to change contents and states for error cases and manually starting the remaining process on the modified contents.
1) Clean old events from the Robots (Hash-Code but not the log entries) once a month for events finished 30 days ago.
2) start/stop scripts
1) Robots must allow to be clustered
7) Transactional Robots
1) when Robots are killed or stopped otherwise (e.g. power outage) the not yet completed tasks are to be redone.
8) uses Java and BDB or other RDBs
9 freelancers are bidding on average $494 for this job
you can check the site [url removed, login to view] ...this site is owned and developed by TaggSoft solutions (i.e. us) and it has its own webcrawler.