Words in Wikipedia

Using the full latest English Wikipedia database (freely available to download), generate a frequency-ranked case-sensitive list of words used in the main pages. These should include single words and groups up to four words (hyphen or space-separated), only text (not Wiki tags), and taken from the middle of sentences (not the first word in each sentence, so all are correctly capitalized).

Provide list of all words and word groups that appear at least 10 times, and provide file containing ten complete sentences in which each word appears and name of wiki page on which it appears, e.g.


[page: Prion]

Prions are hypothesized to infect and propagate by refolding abnormally into a structure which is able to convert normal molecules of the protein into the abnormally structured form.

[page: Mars_Ocean_Hypothesis]

The blue region of low topography in the Martian northern hemisphere is hypothesized to be the site of a primordial ocean of liquid water.


I'm flexible in exactly what format the data is provided, and you can skip groups starting and ending with common stop words (a, the, etc).

Taidot: tietojenkäsittely

Näytä lisää: what is data structure in c, what is data in data structure, what is a data structure in c, list of data structure, list in data structure, stop words wikipedia, word frequency wikipedia, what is word processing, wikipedia, structured, skip , prion, liquid, frequency, e liquid, data structured, structured database, structured form, convert region region, skip list

Tietoa työnantajasta:
( 76 arvostelua ) Brighton, United Kingdom

Projektin tunnus: #386597