MCS Thesis Presentation - Fast Calculation of N-Gram-Based Phrase Similarity

Who: Zichu Ai

Title: Fast Calculation of N-Gram-Based Phrase Similarity

Examining Committee:

Norbert Zeh - Faculty of Computer Science (Supervisor)
Abidalrahman Mohammad - Faculty of Computer Science (Co-Supervisor)
Evangelos Milios (Reader)
Vlado Keselj (Reader)
Raghav Sampangi (Chair)



Text Relatedness using word and phrase relatedness method (TrWP) is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web-1T corpus. The phrase similarity computation in TrWP has signicant overhead in time and memory cost, making TrWP impractical for real-world usage. This thesis presents an in-memory computational framework for TrWP, which optimizes the calculation process by ecient indexing and compact storage using perfect hashing, parallelism, quantization and variable length encoding. Using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms the le-based TrWP framework by 5 to 6 orders of magnitude.



Room 311, Goldberg Computer Science Building