Words used in Wikipedia

Using the full latest English Wikipedia database, write a program to generate a frequency-ranked case-sensitive list of words used in the main entry pages. These should include single words and groups up to four words (hyphen or space-separated), only text (not Wiki tags), and taken from the middle of sentences (not the first word in each sentence, so all are correctly capitalized).

Provide list of all words and word groups that appear at least 10 times in Wikipedia, and provide a file containing ten complete sentences in which each word appears and name of wiki page on which it appears, e.g.


[page: Prion]

Prions are hypothesized to infect and propagate by refolding abnormally into a structure which is able to convert normal molecules of the protein into the abnormally structured form.

[page: Mars_Ocean_Hypothesis]

The blue region of low topography in the Martian northern hemisphere is hypothesized to be the site of a primordial ocean of liquid water.


I'm flexible in exactly what format the data is provided, and you can skip groups starting and ending with common stop words (a, the, etc).

The main objective is the result, so you can write the program in any language you like. You'll need to download the Wikipedia database from [url removed, login to view]; the project is very straightforward, but the database is quite large.

Taidot: C-ohjelmointi, PHP, Python, XML

Näytä lisää: providelist wikipedia, what is data structure in c, what is data in data structure, what is a data structure in c, what is a data entry form in a database, starting objective c, region 13, list in data structure, i need a wikipedia page, convert php to objective c, words used webmaster, wikimedia download, objective c language, wikipedia, wikimedia, structured, skip , prion, php wikipedia, liquid

Tietoa työnantajasta:
( 76 arvostelua ) Brighton, United Kingdom

Projektin tunnus: #387392