We want to crawl the web to get: 1)? lists of the words used in different? languages on the web, and 2) a count of the number of times each word is found in each language UNTIL WE HAVE A STATISTICALLY SIGNIFICANT SAMPLE. Maybe 1000 pages of each language? We do not have a list of URLs we want to use. All that matters is that we do not count the same page twice. Other than that, ANY 1000 pages of each language will be fine. I imagine that the program will crawl pages by charset, CHECK to be sure the page is the "correct" language (per the charset tag) by comparing the simplest words in that language (see CHECK below), count the words on the page, note which page it is so it does not get counted again, and move on. CHECK Because charset tags are not alway reliable, we would pick 20 (or so) common words that are unique (and really common) to each language. E.G. an English example: the, an, in, are, is, and, to, on, this, a, by, that, were, have, been, will, a, of ...and then look for a meaningful subset of them to appear on a page before deciding what language it is. Obviously, we would test the search mechanism "by hand" first to be sure it worked in each language.) Note: I will identify the "check" words for each language, and be accordingly be responsible for the quality of this language filter. The? app will place the words and count into an Excel spreadsheet. (one sheet per language). As an example, after using this tool in English (and sorting by frequency within Excel) there would be? VERY long list, with a number next to it (indicating how many times it was found) like: the? 9,323,343 of? ? 9,028,282 and 9,003,939 a? ? ? ? 8,757,232 etc.... The languages of interest are: Afrikaans, Arabik,? Bulgarian, Catalan, Pinyin (Chinese), Croatian, Czeck, Dutch, English, Estonian, Finnish, French, German, Greek, English, German, French, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Norwegian, Polish, Portugese, Romanian, Serbian, Slovak, Slovenian, Spanish, Swahili,? Swedish, Tagalog, Thai, Turkish, Urkranian and Vietnamese.
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. 2) Installation package that will install the software (in ready-to-run condition) on the platform(s) specified in this bid request. 3) Exclusive and complete copyrights to all work purchased. (No GPL, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site).
We are running Windows 2000, IE 6, and Excel 2002.