Create a web spider as linux daemon

A linux deamon should spider intranet websites and extract some data.

The base urls of the intranet servers are given as ([url removed, login to view], [url removed, login to view] ... [url removed, login to view]).

A C++ application (deamon) should be built with the following interface which allows to manage/create a list of pages (urls):

- add a host to be spidered (going through all pages on this site, creating a list of the pages of a site)

- add a single url to be spidered (adding it to the list of pages of a site)

- remove a host (not to be spidered in future, deleting all related xapian data and lists of pages)

- remove a single url, all of the related xapian data and removing it from the list of pages to be spidered

- allow to set a list of url parameters that should be ignored (session ids for example)

- specify a time interval after wich an already spidered url has to be spidered again

- specify a time interval for following calls on a site-IP, preventing to "overload" it

- specify a max_depth parameter, defining how deep the site should be crawled

- for each site host, an according process should do this job. e.g. 10 site-IPs to spider -> 10 processes

The interface should allow to define: Spider all urls from [url removed, login to view], all from [url removed, login to view] except [url removed, login to view] plus spider only [url removed, login to view]

The processes which spider through the list of pages should...

- get the content of each url, splitting it into text (content without html tags) , encoding (charset), title, canonical url and description (from meta info), current date+time*.

- give this data to a different application through a function call.

The spider should not come into infinite loops, therefore it has to check, if the raw site content of an url is identical of an url with some different parameter. If possible, it should use the canonical tag for this.

To determine, if a site has already been spidered, the according process can "ask" (function call) if the url has already been spidered (based on the data extracted with *), and if yes, if it was more than max_interval days ago. Yes: spider again and get data, no: continue with next url.

Starting points:

- [url removed, login to view]

- [url removed, login to view]

- [url removed, login to view]

Taidot: C++ -ohjelmointi, tietojärjestelmäarkkitehtuuri

Näytä lisää: www slideshare net, web crawler architecture, the linux programming interface, tags how to create the web site, starting web programming, programming loops, net programming websites, meta programming, loops programming, loops in programming, linux programming interface, linux deep web, job web crawler, intranet architecture, how to process the data base, how to create html web pages, how to create a site web, how can create a web site, get a job programming, define on-going

Tietoa työnantajasta:
( 3 arvostelua ) Eichberg, Switzerland

Projektin tunnus: #4979819