November 27, 2013

tf-idf, Term Frequency–Inverse Document Frequency

tf-idf is stands for Term Frequency - Inverse document frequency and it is one of the effective algorithms to extract the keywords from a given document. It is often used in NLP and IR. The extraction is performed in a statistical measure by calculating the weight of a word in a document list or a corpus. Many search engines using the benefits of tf-idf to extract keywords, frequently with combinations of other algorithms. Word's weight or importance is measured by the word frequency in a document and offset by the same word's frequency in the given corpus.

tf-idf = term frequency x inverse document frequency

For example, if the word 'peace' appears 6 times in a document with 100 words the tf is 6/100 = 0.06. And if the corpus or document list contains 1000 documents and if 'peace' appears in 200 documents in the corpus then the idf is 1000/200 = 5. Hence the tf-idf for the word is 0.06 x 5 = 0.3. The weight or tf-idf is directly related to the importance word, i.e. if the tf-idf is higher then the importance of the word is high.



I've written a python module to extract the keywords from a given corpus. This is useful if you want to extract the keywords from a given website links and categorized them according to the keywords. You can use the code freely by downloading from the following Github location.

https://github.com/ludmal/pylib

Complete sample of the usage can be found here: https://github.com/ludmal/pylib/blob/master/sample.py