Find Jobs
Hire Freelancers

Text mining in Python to identify interesting document files

$30-250 CAD

Suoritettu
Julkaistu yli 11 vuotta sitten

$30-250 CAD

Maksettu toimituksen yhteydessä
I want to do a simple text mining task on a large number of files with python code. The files are stored in a few larger network shares and sum up to about a million of files, with about 100 different filetypes. The "text and document" filetypes that are considered extra interesting are microsoft office files, pdf and text (doc, docx, xls, xlsx, xlsm, ppt,pptx, pdf, txt, html, xml, etc..). But there are also a few binary files, movies and others that are considered as less interesting. I have a list of about 100 interesting words organized in a textfile (txt) , one word per row. Now I want to identify all files that contain one or more instances of the words in their filename, path or in their file contents. I would like to get this task solved in python. I am not an experienced python programmer and I would like to get the code well written, well annotated and easy to modify. Communication and code should be in english. The code should work preferable in Windows, MacOSX and Linux. I would like 1) A script to list all the files (not folders) in a networks share. Number the list, one file per row. List interesting file information, Columns separated by semicolon (;). Something like this. File Counter; Full path; creation date; modifcation date; file owner, filetype; etc.. 1; C:/mypath/[login to view URL]; date; date; owner; text document 2; C:/mypath/[login to view URL]; date; date; owner; Microsoft Word 2) One or several scripts to scan the names and contents (based on the file of interesting words) of the files from script 1). "Document filetypes and textfiles" (see above) shall be scanned for both the content and the full path (filname+path). Other files don't have to be scanned for content but need to be scanned for the full path. The script(s) shall report the action on each file and the result (name scan: yes/no/error, content scan: yes/no/error). If an error occurs with a file (e.g. for reading or parsing) this need to be stated in the result of the file scan, but should not interrupt the scan. The number of matches in the content and by which word should be stated. The number of unique positive (identified) words in the list of words, i.e. must be 0-about 100 should also be stated for each file. Final output should be a similar list as in 1) but with additional columns containing. e.g. Path Scan; Content Scan; #Matches; Words matched; #Unique matches (Yes/No/Error); (Yes/No/Error); Integer; Monkey, Bannanas; 0-100, Provide me with a way to run this search as easy and quickly as possible. (There are a lots of files and speed is important.) I don't want to wait two weeks for the search and I don't want to find out that the scripts run into an error after three hour and stops. The bidders with recommendations, high evaluations, strong background in python and text mining are preferred.
Projektin tunnus (ID): 2473761

Tietoa projektista

7 ehdotukset
Etäprojekti
Aktiivinen 12 vuotta sitten

Haluatko ansaita rahaa?

Freelancerin tarjouskilpailun edut

Aseta budjettisi ja aikataulu
Saa maksu työstäsi
Kuvaile ehdotustasi
Rekisteröinti ja töihin tarjoaminen on ilmaista
Myönnetty käyttäjälle:
Käyttäjän avatar
I have significant experience with data extraction and machine learning, so creating a script like this should not be a problem for me. Documenting the code and communicating with you won't be an issue either, as I speak fluent English. Please check your PMB for some more info.
$220 CAD 10 päivässä
4,8 (20 arvostelua)
5,3
5,3
7 freelancerit tarjoavat keskimäärin $203 CAD tätä projektia
Käyttäjän avatar
I have experience in text mining, but I will have to learn some things about network interfacing. I will complete this project in the time allotted.
$250 CAD 14 päivässä
4,9 (8 arvostelua)
4,7
4,7
Käyttäjän avatar
I have very good experience in data mining algorithms and Python.
$200 CAD 10 päivässä
5,0 (2 arvostelua)
3,2
3,2
Käyttäjän avatar
Sounds like a hefty challenge. I'm up for that. Based in Toronto
$250 CAD 3 päivässä
0,0 (1 arvostelu)
2,3
2,3
Käyttäjän avatar
Custom software development (<b><i>Removed by Admin</i></b>)
$250 CAD 1 päivässä
0,0 (0 arvostelua)
0,0
0,0
Käyttäjän avatar
Being a system administrator who deals in Python, this is right up my alley. We'll just have to clarify what your network setup is.
$200 CAD 3 päivässä
0,0 (0 arvostelua)
0,0
0,0
Käyttäjän avatar
I have extensive knowledge in text mining and data mining as I had taken courses related to those while at school(Stanford University)
$50 CAD 2 päivässä
0,0 (0 arvostelua)
0,0
0,0

Tietoja asiakkaasta

Maan SWEDEN lippu
Tullinge, Sweden
5,0
1
Maksutapa vahvistettu
Liittynyt syysk. 9, 2012

Asiakkaan vahvistus

Kiitos! Olemme lähettäneet sinulle sähköpostitse linkin, jolla voit lunastaa ilmaisen krediittisi.
Jotain meni pieleen lähetettäessä sähköpostiasi. Yritä uudelleen.
Rekisteröitynyttä käyttäjää Ilmoitettua työtä yhteensä
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Ladataan esikatselua
Lupa myönnetty Geolocation.
Kirjautumisistuntosi on vanhentunut ja sinut on kirjattu ulos. Kirjaudu uudelleen sisään.