Peruttu

crawling and extracting program

this is a part of the description

it starts with the output from an extractprogram

like this

"Company","Address","Telephone","Mobile","Website","Email"

"Apotheek Centrum Schelle","Provinciale Steenweg 95 2627 Schelle","03 887 54 72","","",""

"De Lindeboom","Nationalestraat 119 2000 Antwerpen","","","",""

"Morel E","Brusselsesteenweg 298 2800 Mechelen","015 41 55 65","","",""

"Van De Mierop-Mestdagh BVBA","Lindenlaan 66 2340 Beerse","014 61 13 64","","",""

"Hooijmaaijer J","Clemenceaustraat 43 2860 Sint-Katelijne-Waver","015 21 22 93","","",""

"Horsten L NV","Vrijheid 98 2320 Hoogstraten","03 314 57 24","","",""

"Vandeweyer R","Oranjestraat 94 2060 Antwerpen","03 233 82 75","","",""

"Danckaert J","Ter Heydelaan 173-175 2100 Deurne (Antwerpen)","03 324 95 30","","",""

"Vermylen K BVBA","Leo Kempenaersstraat 7 2223 Schriek (Heist-Op-Den-Berg)","015 23 33 70","","",""

"Onze Apotheek cv","Antwerpsesteenweg 146 Bus 1 2500 Lier","","","",""

"Peleman","Schipstraat 1 2870 Puurs","03 889 23 63","","",""

"Ter Borcht BVBA","Bernard van Orleyplein 5 2650 Edegem","03 440 64 91","","",""

"De Lindeboom-Apotheek","Nationalestraat 119 2000 Antwerpen","","","",""

the program must work with diffrent steps

the first step is checking if there is a site ( iff i the input fille already has an url it can go imidiatly

to step 2 )

exemple "Pica Pica","Hofkwartier 20 2200 Herentals","014 22 02 55","",""

try the following urls [url removed, login to view] [url removed, login to view] they both have a site so it must be

crawld to look for an adress in this example this url [url removed, login to view] is the right one it can

use pica pica as businessname in the output and must add the url in the output

iff in the businessname is bvba , nv , one letter , and 't it may not be used in the url

exemple pica pica nv only try [url removed, login to view] or [url removed, login to view] not [url removed, login to view]

another example "Sleepwise","Turnhoutsebaan 328B 2970 Schilde","03 385 31 21","",""

url [url removed, login to view] is a site crawl for the adres

this is the adress on the site

Turnhoutsebaan 225 - B-2970 Schilde

only the number 225 is diffrent , make the program so that it then uses this adress because only 1

thing is diffrent but use the adress from the site then in the output

also iff there is no .be try .com then like this example

"Poppels Meubelhuis","Zandkuilstraat 23 2382 Poppel (Ravels)","","",""

there is no .be but [url removed, login to view] is a site an on that site is the right adress

Poppels meubelhuis, Tilburgseweg 64 (Slaapwinkel),

Zandkuilstraat 23 (Woonwinkel), B-2382 Poppel, België

tel.: +32 (0)14 65 78 54, fax: +32 (0)14 65 94 69

e-mail:

also here can the e-mail and faxnumber being added to the output the things between ( ) are not important

example "C-Meubel","Antwerpsesteenweg 19 2840 Rumst","015 31 77 16","",""

there is no [url removed, login to view] so also try [url removed, login to view] and that does exist there is also

the adress on the site so it is a good one las e-mail that can be addad to the output

another example "VI-Spring","Dorp 78 2230 Herselt","014 54 55 11","",""

has a site [url removed, login to view] but ther is not the adress so this cannot be use and this

businessname must be checked in the next step

another exaple businessname Hof van Aragon NV ( something between ( ) must not been used )

[url removed, login to view] is a site bus imediatly rediricts you to [url removed, login to view] it must then look

for the adress on [url removed, login to view]

another example businessname Zuid-West

[url removed, login to view] is a site and has the right adress

another example "Odrada Interieur NV","Molsesteenweg 46 2490 Balen","014 34 66 00","",""

no site for [url removed, login to view] , [url removed, login to view] , [url removed, login to view] [url removed, login to view]

but for [url removed, login to view] is a site and that site has the right adress

another example businessname is "de lindeboom" the urls that need to be tryd are [url removed, login to view]

[url removed, login to view] [url removed, login to view] [url removed, login to view] [url removed, login to view] [url removed, login to view]

iff there are sites the site must be crawld for the adress ( if the url i rediricted then that site must be crawled )

in this case [url removed, login to view] is the right site

iff there is an e-mail also add it to the output

----------------------------------------------------------------------------------------------------

Step 2 ( iff it has already a site )

if the listing has a site the program must check iff the site is still online

then the program must check iff the bussinesname is in the title

example "Luigi Lloyd Loom","Puursesteenweg 392B 2880 Bornem","03 899 26 35","","http://www.luigi.be"

the site is still online [url removed, login to view] the title is

<title>:: Luigi - Original Lloyd loom - Exclusive Rattan furniture - Outdoor furniture - Bedrooms ::</title>

Luigi Lloyd Loom is in the title so the businessname can be the same

( iff there was only luigi in the title the businessname must be changed in luigi

iff there is noting in the title crawl te site looking for the adress and the businessname , iff it doesnt

find an adress and a part ofthe businessname it must go to step 3 if it finds the adress and part of the

bussinesname ( example only luigi ) then it must use the part of the businessname ( luigi and not luigi Lloyd loom )

and adress in the output

Taidot: tiedonsyöttö, tietojenkäsittely, Javascript

Näytä lisää: www cmeubelen rumst, apotheek hooijmaaijer, try javascript online, the good cv, program find, program c++ online, program c# online, online it program, letter example, good it cv, good example of cv, good example cv, good cv example, good cv, example of good cv, example of a good cv, example letter, example good cv, data den, cv good example, cv good, cv example good, cv example, cv description, c++ online program

Tietoa työnantajasta:
( 5 arvostelua ) Hulshout, Belgium

Projektin tunnus: #425800