MCSc Thesis Defence - Fast Calculation of N-Gram-Based Phrase Similarity

When: December 4, 2017 @ 10:00 am


Who: Zichu Ai

Title: Fast Calculation of N-Gram-Based Phrase Similarity

Examining Committee:

Norbert Zeh - Faculty of Computer Science (Supervisor)
Abidalrahman Mohammad - Faculty of Computer Science (Co-Supervisor)
Vlado Keselj - Faculty of Computer Science (Reader)
Evangelos Milios - Faculty of Computer Science (Reader)

Chair: Raghav Sampangi - Faculty of Computer Science


Text Relatedness using word and phrase relatedness method (TrWP) is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web-1T corpus. The phrase similarity computation in TrWP has signicant overhead in time and memory cost, making TrWP impractical for real-world usage. This thesis presents an in-memory computational framework for TrWP, which optimizes the calculation process by ecient indexing and compact storage using perfect hashing, parallelism, quantization and variable length encoding. Using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms the le-based TrWP framework by 5 to 6 orders of magnitude.



Room 311, Goldberg Computer Science Building