Peruttu

web data crawler

1) Robots get a list of URLs -- called IndexURLs later on. This list of IndexURLs is modified regularily so that the Robots need to evaluate it when changed.

2) Based on each IndexURL a new set of URLs is generated -- called DetailURL later on. The DetailURL is where the relevant Information for a new Event, Location or Performer is to be extracted.

1) The pattern for location the DetailURLs needs to be configurable, e.g. in the list of IndexURLs

2) The list (could be a queue) of DetailURLs to be processed is to be updated daily.

3) Already processed DetailURLs must be skipped and not added to the list again.

3) Processing of each DetailURL

1) from a new DetailURL the contents are extracted

1) The Robots must be able to determine the language and type of item (Event, Location, Performer), copyright-safe (adding copyrighted contents but flagging it) based on clear criterias which can be easily handeled automatically.

2) The API should be used in order to check if the item is already present in the system (This case needs to be specified in more detail!!!!!!)

3) The API should be used in order to add the new item

4) Add the processed DetailURL to the Scheduler for Processing in order to check the event itself, the location and performer 1 week, 3 days and on same day of the event.

5) save a hash of the contents for later comparision

6) templates/patterns for locations, events and performers must be configurable for each IndexURL

2) from a old DetailURL (=scheduled) the contents are extracted

1) check with the previously stored hash value

2) if different update the item, else finished

4) Logging/Statistics

1) the current status must be visualized (which IndexURL/DetailURL is processed, errors occured during extraction or API invocation etc.)

2) API calls must be logged and visualized

3) Processing the IndexURLs and DetailURLs must be logged and visualized

4) List of IndexURLs and DetailURLs mus be visualized and searchable

5) Editors must be able to change contents and states for error cases and manually starting the remaining process on the modified contents.

5) Maintenance

1) Clean old events from the Robots (Hash-Code but not the log entries) once a month for events finished 30 days ago.

2) start/stop scripts

6) Scalability

1) Robots must allow to be clustered

7) Transactional Robots

1) when Robots are killed or stopped otherwise (e.g. power outage) the not yet completed tasks are to be redone.

8) uses Java and BDB or other RDBs

Taidot: tietojenkäsittely, tekniikka, Java, XML

Näytä lisää: set data, scripts web, on this day templates web, new web templates, language web, java start web, get web statistics, daily item, spider crawler, web language, check data, different type of web templates, Web java, web extraction, web data, Web Data Extraction, web api, scalability, power on event, performer, patterns, mus, maintenance java, java hash, hash

Tietoa työnantajasta:
( 0 arvostelua ) köln, Germany

Projektin tunnus: #405320

9 freelancers are bidding on average $494 for this job

SigmaVisual

We can help in your project, please check PMB to see our related experience.

600 $ USD 7 päivässä
(47 arvostelua)
6.5
interpb

Hi friend pls view pmb

250 $ USD 10 päivässä
(55 arvostelua)
6.1
tamrakar81

hi, Please refer PMB thanks

750 $ USD 15 päivässä
(36 arvostelua)
5.0
lokeshverma2

Hi Sir, Please respond in PMB

700 $ USD 7 päivässä
(2 arvostelua)
2.5
Mahairod

Interesting project

550 $ USD 60 päivässä
(0 arvostelua)
0.0
merlin41

Hello, an already existing framework solution can be available. regards,A

250 $ USD 1 päivässä
(0 arvostelua)
0.0
javva

Hello, I have complete experence to handle your needs. This could be done in 2 ways: 1) some j2ee aplication server (pro: web-interface to track status, automatic clustering) 2) console java applications (robots) + swi Lisää

500 $ USD 10 päivässä
(0 arvostelua)
0.0
adamxrc

Hi, check PMB for details, thanks.

500 $ USD 15 päivässä
(0 arvostelua)
0.0
taggsoft

you can check the site [url removed, login to view] ...this site is owned and developed by TaggSoft solutions (i.e. us) and it has its own webcrawler.

350 $ USD 10 päivässä
(0 arvostelua)
0.0