language web crawler

We want to crawl the web to get: 1)? lists of the words used in different? languages on the web, and 2) a count of the number of times each word is found in each language UNTIL WE HAVE A STATISTICALLY SIGNIFICANT SAMPLE. Maybe 1000 pages of each language? We do not have a list of URLs we want to use. All that matters is that we do not count the same page twice. Other than that, ANY 1000 pages of each language will be fine. I imagine that the program will crawl pages by charset, CHECK to be sure the page is the "correct" language (per the charset tag) by comparing the simplest words in that language (see CHECK below), count the words on the page, note which page it is so it does not get counted again, and move on. CHECK Because charset tags are not alway reliable, we would pick 20 (or so) common words that are unique (and really common) to each language. E.G. an English example: the, an, in, are, is, and, to, on, this, a, by, that, were, have, been, will, a, of ...and then look for a meaningful subset of them to appear on a page before deciding what language it is. Obviously, we would test the search mechanism "by hand" first to be sure it worked in each language.) Note: I will identify the "check" words for each language, and be accordingly be responsible for the quality of this language filter. The? app will place the words and count into an Excel spreadsheet. (one sheet per language). As an example, after using this tool in English (and sorting by frequency within Excel) there would be? VERY long list, with a number next to it (indicating how many times it was found) like: the? 9,323,343 of? ? 9,028,282 and 9,003,939 a? ? ? ? 8,757,232 etc.... The languages of interest are: Afrikaans, Arabik,? Bulgarian, Catalan, Pinyin (Chinese), Croatian, Czeck, Dutch, English, Estonian, Finnish, French, German, Greek, English, German, French, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Polish, Portugese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swahili,? Swedish, Tagalog, Thai, Turkish, Urkranian and Vietnamese.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site).

## Platform

We are running Windows 2000, IE 6, and Excel 2002.

Taidot: Tietokannan hallinto, tekniikka, MySQL, PHP, tietojärjestelmäarkkitehtuuri, Ohjelmistojen testaus, SQL, Web hosting, Verkkosivun hallinta, Verkkosivujen testaus

Näytä lisää: working of web crawler, web search tool, web page spreadsheet, web languages list, web-crawler, web 2.0 languages, subset test, spreadsheet web page, spreadsheet web form, spreadsheet on web page, spreadsheet on the web, spreadsheet in web page, spreadsheet engineering, slovenian to english, serbian to french, romanian to french, portugese to spanish, not sure in spanish, list of web languages, language web

Tietoa työnantajasta:
( 9 arvostelua ) United States

Projektin tunnus: #3000944

Myönnetty käyttäjälle:


See private message.

$85 USD 25 päivässä
(644 Arvostelua)

5 freelanceria on tarjonnut keskimäärin %project_bid_stats_avg_sub_26% %project_currencyDetails_sign_sub_27% tähän työhön


See private message.

$425 USD 25 päivässä
(103 arvostelua)

See private message.

$425 USD 25 päivässä
(10 arvostelua)

See private message.

$425 USD 25 päivässä
(6 arvostelua)

See private message.

$425 USD 25 päivässä
(0 arvostelua)