The outcome of the development project hast o be a WebCrawler for websites (mainly company sites) containing job offers which allows the extraction of content and stores it in a database.
Extraction of content
The content of the websites containing the job offers (job description) has to be extracted in a text format and exported in a mysql database whereas the format should be “onlinedate, offlinedate, id, category, url, jobtitle, jobdescription, e-mail (if existing), phone (if existing), contactperson (if existing)”. The software also has to be able to crawl job portals like [login to view URL], [login to view URL], [login to view URL],...
[login to view URL]://[login to view URL]://[login to view URL]
[login to view URL]
[login to view URL]
It has to be considered that on some of the company websites you need a search engine in order to get to the jobs.
Validity of jobs
The software also has to verify if jobs are taken offline by a company and are not online anymore. The jobs than has to be marked offline in the database with the corresponding date the job has been taken offline (date of crawling process where the software has identified that the job is not online anymore).
Definition / Configuration of URLs
The configuration of the software must allow, that the URLs of the websites which have to crawled including SUB-URLs can be defined. The URLs are mainly company webpages whereas the software has to identify on which pages job offers are published.
Please do not bid on the project if you are not familiar with the technology of web crawlers.
Configuration of Keywords
Besides the URLs also keywords can be defined whereas a website will only be extracted if predefined keywords are on the site. Keywords can be grouped in a category so that a job which has been found can be put in a certain job category (e.g. consulting, banking,…).
Progress and statistics
A small progress and statistic module of the software always has to show the progress of a crawling process as well as the result (websites crawled, new jobs found, jobs updated, jobs taken offline, websites a job was found). The data has to be provided for each new crawling process.
In order to identify problems with a crawling process an error log has to be written which allows the identification of the problems which have occurred during a crawling process.
The software has to identify if job offers are posted on different sites. For example on a company home page as well as on a job portal. This information has to be stored in the database.
The WebCrawler must ensure a fast crawling process.
The software has to be documented in a proper way (functions, methos, etc.) so that a third party is able to understand it.
The payment amount will be put in an escrow agreement. After the software has been finished, delivered and quality assured the escrow payment will be released.
16 freelanceria on tarjonnut keskimäärin %project_bid_stats_avg_sub_26% %project_currencyDetails_sign_sub_27% tähän työhön
We possess extensive experience of developing numerous high-end websites and are highly organized and adept at meeting tight deadlines that are so common in this industry. Please see PMB for more details.
Hi, I have experience of working on WebCrawler bot. I have been working for a WebCrawer boot which download data from hundreds of site. Please see my PM. Thanks, Thuc