November 27, 2013

tf-idf, Term Frequency–Inverse Document Frequency

tf-idf is stands for Term Frequency - Inverse document frequency and it is one of the effective algorithms to extract the keywords from a given document. It is often used in NLP and IR. The extraction is performed in a statistical measure by calculating the weight of a word in a document list or a corpus. Many search engines using the benefits of tf-idf to extract keywords, frequently with combinations of other algorithms. Word's weight or importance is measured by the word frequency in a document and offset by the same word's frequency in the given corpus.

tf-idf = term frequency x inverse document frequency

For example, if the word 'peace' appears 6 times in a document with 100 words the tf is 6/100 = 0.06. And if the corpus or document list contains 1000 documents and if 'peace' appears in 200 documents in the corpus then the idf is 1000/200 = 5. Hence the tf-idf for the word is 0.06 x 5 = 0.3. The weight or tf-idf is directly related to the importance word, i.e. if the tf-idf is higher then the importance of the word is high.



I've written a python module to extract the keywords from a given corpus. This is useful if you want to extract the keywords from a given website links and categorized them according to the keywords. You can use the code freely by downloading from the following Github location.

https://github.com/ludmal/pylib

Complete sample of the usage can be found here: https://github.com/ludmal/pylib/blob/master/sample.py



July 23, 2013

GIT Tortoise

I've recently moved all my projects to GIThub.com This is mainly due to the fact the popularity gain by GITHub compare to Google code. I was a big fan of Tortoise SVN and luckily it is now available for GIT as well. You can download it from the following location:
https://code.google.com/p/tortoisegit/

Also if anyone wants to use GIT with Visual Studio, here is how;

1. Install GIT for windows
https://code.google.com/p/msysgit/downloads/list?q=full+installer+official+git

2. Install GIT source control provider
http://gitscc.codeplex.com/

and your done! Don't forget to select GIT from the VS Source control settings.