Käynnissä

Simple data mining script/ Perl REGEX

Dear all,

I need somebody to write a simple parser in PERL. The task is quite straightforward.

I have a large data set of the following structure

id_1 | paragraph_1

id_2 | paragraph_2

id_3 | paragraph_3

etc.

where id_n is a running id and paragraph_n is a variable containing text data.

paragraph_n:

In FY 2005 que posuere aucibulum justo. Classa. Maecenas quam sociosque nunc ultrices. Nunc ipsum accumsan vive. Vitae mollis ut tor ristique mauristibus feugiat in 1991. As of 2008 Intesque augue nunc, rutrum, urnar. Donec vel, orna et mollicies wisis fermentum aliquam. Nulla fafring. In March 2006 Nullam leo for 2008 trisuscetuer condisse vitae mollis ut tor ristique mauristibus feugiat in nissim a. Donec vel, orna et mollicies wisis fermentum 8769. mollicies wisis fermentum aliquam in FY 2007. Nullam leo trisuscetuer condisse vitae mollis ut tor before 2006. During 2007 Nullam leo trisuscetuer condisse vitae mollis ut tor. From 2007 through 2010 condisse vitae mollis ut.

This paragraph contains 9 time references in total:

In FY 2005

in 1991

As of 2008

In March 2006

for 2008

in FY 2007

before 2006

During 2007

From 2007 through 2010

The script is required to perform two different tasks:

1.) The script should extract all time references. Time references are of two different kinds: Some are located at the beginning of a sentence, example given "In FY 2005", whereas others are inside a sentence, e.g. "in 1991". For each paragraph the script should extract all time references and write these to a pipe-separated output file with the corresponding id’s , that is:

id | time_ref_1 | time_ref_2 | time_ref_3 | time_ref_4 | ..... | time_ref_n

Please note that each paragraph is likely to have a different number of time references (maximum is around 20). I already have a list of regular expressions identifying time references which I will supply. However, the list is incomplete, and one task of the coder would be to find more regular expressions which identify time/years/months.

2.) The script should also split each paragraph into subparagraphs. Whenever a sentence in a paragraph starts with a capital “time reference” (Examples: In FY, In March 2003, As of etc.), this sentence and all sentences until the next capital time reference should be parsed into a new sub_par_n variable and written into a pipe-separated output file. For the example at hand

id | sub_par_1 | 1

id | sub_par_2 | 2

id | sub_par_3 | 3

id | sub_par_4 | 4

id | sub_par_5 | 5

where

sub_par_1:

In FY 2005 que posuere aucibulum justo. Classa. Maecenas quam sociosque nunc ultrices. Nunc ipsum accumsan vive. Vitae mollis ut tor ristique mauristibus feugiat in 1991.

sub_par_2:

As of 2008 Intesque augue nunc, rutrum, urnar. Donec vel, orna et mollicies wisis fermentum aliquam. Nulla fafring.

sub_par_3 :

In March 2006 Nullam leo for 2008 trisuscetuer condisse vitae mollis ut tor ristique mauristibus feugiat in nissim a. Donec vel, orna et mollicies wisis fermentum 8769. mollicies wisis fermentum aliquam in FY 2007. Nullam leo trisuscetuer condisse vitae mollis ut tor before 2006.

sub_par_4

During 2007 Nullam leo trisuscetuer condisse vitae mollis ut tor.

sub_par_5:

From 2007 through 2010 condisse vitae mollis ut.

Whenever there is only one single paragraph, sub_par equals the paragraph.

If you should require further information, please feel free to private message me.

Best regards

Philipp

Taidot: tiedonsyöttö, tietojenkäsittely, Perl

Näytä lisää: perl regex sentence, perl data mining, simple data mining, write sentences find, simple data structure, set in data structure, set data structure, set data, regular expressions list, regular expressions in c, regular expressions examples, regular expressions example, regular expressions c, regex is, regex in c, regex examples, regex example, regex c, que data structure, new data structure, n equals, need of data structure, list of data structure, list in data structure, list data structure

Tietoa työnantajasta:
( 1 arvostelu ) London, United Kingdom

Projektin tunnus: #428688

Myönnetty käyttäjälle:

edatawiz

Hi - Please check your PM for details.

50 $ USD 1 päivässä
(12 arvostelua)
4.2

9 freelanceria on tarjonnut keskimäärin 39 $ tähän työhön

sajib6

we are ready to start it.

30 $ USD 1 päivässä
(41 arvostelua)
6.5
hiratariq

Hi! Please see PM.

50 $ USD 2 päivässä
(7 arvostelua)
3.9
shwetakagliwal

I can do this quite [url removed, login to view] me a chance. Thanks.

30 $ USD 0 päivässä
(2 arvostelua)
1.8
waynedavis1985

Please send the complete details. thnx

40 $ USD 1 päivässä
(0 arvostelua)
0.0
AkhilaUppada

I can do this please forward me further details

30 $ USD 1 päivässä
(0 arvostelua)
0.0
shafi7136

hi, pls see pm. tks.

30 $ USD 2 päivässä
(0 arvostelua)
0.0
senmal2003

Im already working with the perl programming, im expecting your response.

50 $ USD 2 päivässä
(0 arvostelua)
0.0
swapy345

pls check PM

40 $ USD 1 päivässä
(0 arvostelua)
0.0