Ensemble of multimodal deep learning autoencoder for infant cry and pain detection

Yosi Kristian, Natanael Simogiarto, Mahendra Tri Arif Sampurna, Elizeus Hanindito, Visuddho Visuddho

Research output: Contribution to journalArticlepeer-review


Background: Babies cannot communicate their pain properly. Several pain scores are developed, but they are subjective and have high variability inter-observer agreement. The aim of this study was to construct models that use both facial expression and infant voice in classifying pain levels and cry detection.  Methods: The study included a total of 23 infants below 12-months who were treated at Dr Soetomo General Hospital. The the Face Leg Activity Cry and Consolability (FLACC) pain scale and recordings of the baby's cries were taken in the video format. A machine-learning-based system was created to detect infant cries and pain levels. Spectrograms with the Short-Time Fourier Transform were used to convert the audio data into a time-frequency representation. Facial features combined with voice features extracted by using the Deep Learning Autoencoders was used for the classification of infant pain levels. Two types of autoencoders: Convolutional Autoencoder and Variational Autoencoder were used for both faces and voices.  Result: The goal of the autoencoder was to produce a latent-vector with much smaller dimensions that was still able to recreate the data with minor losses. From the latent-vectors, a multimodal data representation for Convolutional Neural Network (CNN) was used for producing a relatively high F1 score, higher than single data modal such as the voice or facial expressions alone. Two  major parts of the experiment were: 1. Building the three autoencoder models, which were autoencoder for the infant’s face, amplitude spectrogram, and dB-scaled spectrogram of infant’s voices. 2.  Utilising the latent-vector result from the autoencoders to build the cry detection and pain classification models.    Conclusion: In this paper, four pain classifier models with a relatively good F1 score was developed. These models were combined by using ensemble methods to improve performance, which resulted in a better F1 score.

Original languageEnglish
Article number359
Publication statusPublished - 2023


  • audio frequency features
  • autoencoder
  • deep learning
  • infant cry detection
  • Infant facial pain classification


Dive into the research topics of 'Ensemble of multimodal deep learning autoencoder for infant cry and pain detection'. Together they form a unique fingerprint.

Cite this