TY - GEN
T1 - Hybrid Information Retrieval with Masked and Permuted Language Modeling (MPNet) and BM25L for Indonesian Drug Data Retrieval
AU - Maryamah, Maryamah
AU - Wilsen, Geraldus
AU - Suhalim, Christeigen Theodore
AU - Septiana, Rafik
AU - Fajar, Aziz
AU - Solihin, Mahmud Iwan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Lexical or statistical information retrieval confronts challenges such as the semantic gap and vocabulary mismatch. In the context of medical data, these difficulties are compounded by users' diverse backgrounds, resulting in disparities in perspective and vocabulary. The intricacies of medical language, including spelling variations, frequent acronyms, and ambiguous concepts, further amplify the semantic gap in medical texts. However, adopting a semantic approach can address these issues, albeit introducing a new challenge in the form of a soft matching nature leading to lower recall. In response, we propose a hybrid information retrieval method that combines semantic and lexical approaches. In contrast to recent experiments with the emphasis on utilizing semantic methods as re-ranker, our current approach diverges by incorporating semantic techniques as a fundamental part of the retrieval model. This shift aims to explore the efficacy of semantic methodologies in the initial retrieval stage rather than exclusively relying on them for post-retrieval refinement. The results obtained from both the semantic and lexical retrieval approaches are subsequently subjected to reranking through Reciprocal Rank Fusion. The proposed method outperforms lexical methods such as BM25L, Jaccard Similarity, and Query Likelihood Model, along with semantic methods, including doc2vec, multilingual BERT, IndoBERT, and MiniLM. It has additionally been demonstrated to be more effective than other hybrid models, PLM-based dense retrieval. This technique has successfully capitalized on the strengths of both semantic and lexical methods, resulting in enhanced overall performance in retrieving relevant documents.
AB - Lexical or statistical information retrieval confronts challenges such as the semantic gap and vocabulary mismatch. In the context of medical data, these difficulties are compounded by users' diverse backgrounds, resulting in disparities in perspective and vocabulary. The intricacies of medical language, including spelling variations, frequent acronyms, and ambiguous concepts, further amplify the semantic gap in medical texts. However, adopting a semantic approach can address these issues, albeit introducing a new challenge in the form of a soft matching nature leading to lower recall. In response, we propose a hybrid information retrieval method that combines semantic and lexical approaches. In contrast to recent experiments with the emphasis on utilizing semantic methods as re-ranker, our current approach diverges by incorporating semantic techniques as a fundamental part of the retrieval model. This shift aims to explore the efficacy of semantic methodologies in the initial retrieval stage rather than exclusively relying on them for post-retrieval refinement. The results obtained from both the semantic and lexical retrieval approaches are subsequently subjected to reranking through Reciprocal Rank Fusion. The proposed method outperforms lexical methods such as BM25L, Jaccard Similarity, and Query Likelihood Model, along with semantic methods, including doc2vec, multilingual BERT, IndoBERT, and MiniLM. It has additionally been demonstrated to be more effective than other hybrid models, PLM-based dense retrieval. This technique has successfully capitalized on the strengths of both semantic and lexical methods, resulting in enhanced overall performance in retrieving relevant documents.
KW - BERT
KW - Drug Data Retrieval
KW - Hybrid Information Retrieval
KW - Technology
KW - Word Embeddings
UR - http://www.scopus.com/inward/record.url?scp=85191655976&partnerID=8YFLogxK
U2 - 10.1109/KST61284.2024.10499674
DO - 10.1109/KST61284.2024.10499674
M3 - Conference contribution
AN - SCOPUS:85191655976
T3 - KST 2024 - 16th International Conference on Knowledge and Smart Technology
SP - 242
EP - 247
BT - KST 2024 - 16th International Conference on Knowledge and Smart Technology
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th International Conference on Knowledge and Smart Technology, KST 2024
Y2 - 28 February 2024 through 2 March 2024
ER -