The Integration of a Newly Defined N-gram Concept and Vector Space Model for Documents Ranking

Document Type


Publication Date



Vector space model (VSM) is commonly used in the relevancy ranking and document categorization. The current integration of the N-gram concept in VSM uses unigram, bigram, and trigram as a single term in the TF-IDF weighting. The N-gram concept does not capture the contextual and semantic dependency between nonconsecutive words. This study proposes an approach that considers a document as bag-of-sentences and each sentence as a bag-of-words. Consequently, a new definition of the word N-gram concept is presented as N nonconsecutive words located in the same sentence. Then the approach integrates this newly defined N-gram concept in VSM for measuring the similarity between documents pair. This approach enables the consideration of the relevancy between words in the similarity measure, and the visualization of the relevant words that are common in documents pair. The competency of this approach is verified by ranking the Corporate Social Responsibility (CSR) reports of a set of multinational firms, according to their relevancy to Global Reporting Initiative (GRI) G4 standards.The ranking results are compared to those of ethical rating organizations like Futurescape and CSRHub. Furthermore, the approach is utilized in the classification set of documents in other general categories from a benchmark Reuters data set. The results show the robustness and high accuracy of the proposed approach over the traditional VSM model.