I am looking for a python developer, preferably an expert in NLP, to help me finish a search engine for one of my college courses. The first part of the code, which is an inverted index, is already done.
Please DO NOT change any parts of the pre-existing code, except for the parts instructed. It is important to keep the posting lists as they are - DO NOT shorten them.
As I only have a limited number of characters, i have added a file that contains a more detailed job description, which examples, as well as a screenshot of what the result should look like.
Please read the instructions carefully first and have a look at the screenshot before bidding. It is of great importance to follow the instructions (e.g. NOT using libraries for certain parts)
This task should not be too much trouble for a skilled developer.
Here is the rough outline of what needs to be done:
- the tokens need to be stemmed, using snowballstemmer for German. It MUST be done using a separate function, do not stem in the same function as tokens are counted. I have noted in the code where to add this part.
Stemming has also to be done in the queries. So, for example, if you type in "eating" in the queries (both inverted index AND cosine similarty), anything starting with "eat" should be printed out.
- tf-idf needs to be calculated. MOST IMPORTANTLY: you CANNOT use any libraries for this. So DO NOT use sklearn, tfidfvectorizer or anything like that.
Each part (tf, idf, tfidf) needs to be calculated in a separate function. I have noted where to add these in the code as well.
If you use a library like tfidfvectorizer, or anything else that does the same, I cannot accept the code.
- cosine similary has to be calculated; also MUST be done using a function, NO libraries (No sklearn, etc.)
it has to be calculated based on whatever is typed into a query, comparing to the texts in the corpus.
This query has to be accessed using the main function by typing in "2" in the menu. (menu already implemented; please find the corresponding part in the main function to add the query)
The user should be able to search for words and then see the cosine similarity, tf, idf, and the final tf-idf for the Top N (e.g. Top 10) ranked document names AND document IDs for each result (please view the screen shot for this)
after choosing the option for tf-idf in the menu (menu already implemented, tf-idf is chosen by entering "2"), first, the overall top 10 results (or any other number) for tf-idf should be printed out; without a query (no cosine similarty in this, as it is used for queries only).
it should look something like this:
Documents: [id: name (|d|)]
0: text1, 1: text2, 2: text3,....
dictionary: [term: idf | (doc: tf), (doc: tf), (doc: tf),...]
and then it should ask the user to type something into a query. the result should look something like this (using cosine similarity):
Top 3 containing the queried word(s):
filename1 (file ID, tf | idf)
filename2 (file ID, tf | idf)
filename3 (file ID, tf | idf)
(please view the screenshot for details, you will understand what I mean)
The user should be able to type in more than just one word, but it the texts don't have to contain every single one of the words typed in in order to appear in the results.
the added screenshot, a commented screenshot, and the more detailed project description will give you more details. Please advice these if you need more information. I have also provided some of the texts I am working with.
Please note that the code has to be as simple as possible, nothing too hard/fancy. And it should be quite fast as I have to go through almost 4000 texts.
To test the query with the texts I provided, I recommend searching for "vater sohn" and see if cosine similarity works.