Hybrid Information Retrieval with Masked and Permuted Language Modeling (MPNet) and BM25L for Indonesian Drug Data Retrieval

Maryamah Maryamah, Geraldus Wilsen, Christeigen Theodore Suhalim, Rafik Septiana, Aziz Fajar, Mahmud Iwan Solihin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Lexical or statistical information retrieval confronts challenges such as the semantic gap and vocabulary mismatch. In the context of medical data, these difficulties are compounded by users' diverse backgrounds, resulting in disparities in perspective and vocabulary. The intricacies of medical language, including spelling variations, frequent acronyms, and ambiguous concepts, further amplify the semantic gap in medical texts. However, adopting a semantic approach can address these issues, albeit introducing a new challenge in the form of a soft matching nature leading to lower recall. In response, we propose a hybrid information retrieval method that combines semantic and lexical approaches. In contrast to recent experiments with the emphasis on utilizing semantic methods as re-ranker, our current approach diverges by incorporating semantic techniques as a fundamental part of the retrieval model. This shift aims to explore the efficacy of semantic methodologies in the initial retrieval stage rather than exclusively relying on them for post-retrieval refinement. The results obtained from both the semantic and lexical retrieval approaches are subsequently subjected to reranking through Reciprocal Rank Fusion. The proposed method outperforms lexical methods such as BM25L, Jaccard Similarity, and Query Likelihood Model, along with semantic methods, including doc2vec, multilingual BERT, IndoBERT, and MiniLM. It has additionally been demonstrated to be more effective than other hybrid models, PLM-based dense retrieval. This technique has successfully capitalized on the strengths of both semantic and lexical methods, resulting in enhanced overall performance in retrieving relevant documents.

Original languageEnglish
Title of host publicationKST 2024 - 16th International Conference on Knowledge and Smart Technology
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages242-247
Number of pages6
ISBN (Electronic)9798350370737
DOIs
Publication statusPublished - 2024
Event16th International Conference on Knowledge and Smart Technology, KST 2024 - Krabi, Thailand
Duration: 28 Feb 20242 Mar 2024

Publication series

NameKST 2024 - 16th International Conference on Knowledge and Smart Technology

Conference

Conference16th International Conference on Knowledge and Smart Technology, KST 2024
Country/TerritoryThailand
CityKrabi
Period28/02/242/03/24

Keywords

  • BERT
  • Drug Data Retrieval
  • Hybrid Information Retrieval
  • Technology
  • Word Embeddings

Fingerprint

Dive into the research topics of 'Hybrid Information Retrieval with Masked and Permuted Language Modeling (MPNet) and BM25L for Indonesian Drug Data Retrieval'. Together they form a unique fingerprint.

Cite this