November 27, 2013

tf-idf, Term Frequency–Inverse Document Frequency

tf-idf is stands for Term Frequency - Inverse document frequency and it is one of the effective algorithms to extract the keywords from a given document. It is often used in NLP and IR. The extraction is performed in a statistical measure by calculating the weight of a word in a document list or a corpus. Many search engines using the benefits of tf-idf to extract keywords, frequently with combinations of other algorithms. Word's weight or importance is measured by the word frequency in a document and offset by the same word's frequency in the given corpus.

tf-idf = term frequency x inverse document frequency

For example, if the word 'peace' appears 6 times in a document with 100 words the tf is 6/100 = 0.06. And if the corpus or document list contains 1000 documents and if 'peace' appears in 200 documents in the corpus then the idf is 1000/200 = 5. Hence the tf-idf for the word is 0.06 x 5 = 0.3. The weight or tf-idf is directly related to the importance word, i.e. if the tf-idf is higher then the importance of the word is high.

I've written a python module to extract the keywords from a given corpus. This is useful if you want to extract the keywords from a given website links and categorized them according to the keywords. You can use the code freely by downloading from the following Github location.

Complete sample of the usage can be found here: