Papers by Mohamad Ivan Fanany
The study examines the use of quantization to be applied to Bi-directional Long Short-Term Memory... more The study examines the use of quantization to be applied to Bi-directional Long Short-Term Memory (Bi-LSTM), a combination of the two called qBi-LSTM. Quantization used comes from Deep Belief Networks (DBN). It selected DBN for its superiority as a generative model of Deep Learning in producing an optimal artificial feature. Development of qBi-LSTM is expected to improve the performance of Bi-LSTM and also provide efficient time. The qBi-LSTM test is applied for sleep stage classification on St. Vincent's University Hospital / University College Dublin's Sleep Apnea Database. The result shows that qBi-LSTM has the highest performance compared to Bi-LSTM and DBN with precision, recall and F-measure values of 86.00%, 72.10%, and 75.27%. The best qBi-LSTM performance is to classify Stage 2 but still fails to classify the stage of REM (Rapid Eye Movement).
—This research proposes a method to detect diabetic retinopathy automatically based on fundus p... more —This research proposes a method to detect diabetic retinopathy automatically based on fundus photography evaluation. This automatic method will speed up diabetic retinopathy detection process especially in Indonesia which lack of ophthalmologist. Besides, the difference of doctor ability and experience may produce an inconsistent result. Thus, with this method, we hope automatic detection of diabetic retinopathy will speed up with a consistent result so blindness effect from diabetic retinopathy can be prevented as early as possible. Convolutional Neural Network (CNN) is one of neural network variant which can detect the pattern on an image very well. Residual CNN is one of CNN variant which can prevent accuracy degradation for a deep neural network. Therefore this inspire us to apply Residual CNN on diabetic retinopathy. This Residual Network can detect diabetic retinopathy with kappa score 0.51049.
— Recommender systems (RS) performance largely depends on diverse types of input that characteriz... more — Recommender systems (RS) performance largely depends on diverse types of input that characterize users' preference in the form of both explicit and implicit feedbacks. An explicit feedback is stated directly by an explicit input from users regarding their interest in some options of services or products. Such feedback, however, is not always available. On the other hand, an implicit feedback, which reflects users' opinion indirectly through user behavior is far more abundant. In this paper, we elaborate several ways to improve the RS of three real cases dataset (online travel service, online transportation, and telecommunication service provider) through implicit feedbacks. In the first case, we analyze the effect of a simple feedback from users' input during registration without using any social network analysis (SNA). In the second case, we analyze the effect of community structure extracted from its SNA as its additional attributes. In the third case, we analyze the effect of more additional feedback attributes (modularity, PageRank, eigenvector centrality, clustering coefficient, weighted in degree, weighted outdegree, weighted degree) which also obtained from the SNA of the corresponding dataset. Given the right hyperparameter settings, we observed RS improvement in term of RMSE (root mean square error) in the three cases. In this paper, three RS models: SVD, SVD++, and difference SVD are used. Besides discussing the RS performance, we also discuss the computational cost incurred from incorporating those implicit feedbacks.
With the explosion of data on the internet led to the presence of the big data era, so it require... more With the explosion of data on the internet led to the presence of the big data era, so it requires data processing to get the useful information. One of the challenges is the gesture recognition the video processing. Therefore, the study proposes Latent-Dynamic Conditional Neural Fields and compares with the other family members of Conditional Random Fields. For improving the accuracy, these methods are combined by using Fuzzy Clustering. From the results, it can be concluded that the performance of Fuzzy Latent-Dynamic Conditional Neural Fields are the highest. Also, the combination of the basic classifiers and Fuzzy C-Means Clustering has the higher than the original ones. The evaluation is tested on a temporal dataset of gesture phase segmentation.
— Deep learning is a part of machine learning area that has proven to solve many problems in the ... more — Deep learning is a part of machine learning area that has proven to solve many problems in the real world such as object recognition and detection. One of popular deep learning methods is Faster Region-Based Convolutional Neural Network (Faster R-CNN). Faster R-CNN proposed an integrated structure of CNN and region proposal network to detect multiple objects in a single image. Even though deep learning is powerful for object recognition or detection, it would still be a problem for implementing both the learning and the inference on mobile devices due to the need for a large memory and computation. In this paper, we propose to reduce the number of filters and nodes in the convolutional and fully connected layer to 50% to make it feasible for implementation in a mobile environment and compared it with the original model. Second, we use Structured Sparsity Learning (SSL) in the convolutional layer to regularize Deep Neural Network (DNN) structure with group lasso. Third, we use Ristretto framework to convert floating point to 8 and 16 bits fixed point to represent weights and outputs of the fully connected layer. Our result shows that filter and node number reduction lowering memory storage down to 4.16x and successfully trained on NVIDIA Jetson Tegra TX1 Development Kit as mobile environment emulator. Ristretto successfully condense a model to 16 or 8 bits with error tolerance ~1% but has better accuracy from 0.85 to 0.87 at k = 5 for the original model, and 0.84 to 0.85 at k = 10 for 50% model on CCTV UI dataset. SSL works well on 50% model that obtain better accuracy from 0.83 to 0.84 in k=5 and from 0.84 to 0.86 in k=10 and accelerates computation time 2.72x faster than the original convolution layer without SSL.
—Neural Network can perform various of tasks well after learning process, but still have limitati... more —Neural Network can perform various of tasks well after learning process, but still have limitations in remembering. This is due to very limited memory. Differentiable Neural Computer or DNC is proven to address the problem. DNC consist of Neural Network which associated with an external memory module that works like a tape on an accessible Turing Machine. DNC can solve simple problems that require memory, such as copy, graph, and Question Answering. DNC learns the algorithm to accomplish the task based on input and output. In this research, DNC with MLP or Multi-Layer Perceptron as the controller is compared with MLP only. The aim of this investigation is to test the ability of the neural network to learn explicit and implicit knowledge at once. The tasks are sequence classification and sequence addition of MNIST handwritten digits. The results show that MLP which has an external memory is much better than without external memory to process sequence data. The results also show that DNC as a fully differentiable system can solve the problem that requires explicit and implicit knowledge learning at once.
—It is widely known that visual cues play an important role in speech, especially in disambiguati... more —It is widely known that visual cues play an important role in speech, especially in disambiguating confusable phonemes or as a means for " hearing " visually. Interpreting speech only through visual signal is called lip reading. Lip reading has several potential application as a complementary modality to speech recognition or as purely visual speech recognition, which gives rises to silent speech interface, which by itself has numerous practical application. Although the overwhelming potential of such system, research on lip reading for the Indonesian language was extremely limited, with settings still very distant from the real world. This research is an attempt to make a lip reading model that has the potential to be applicable in the real world, specifically by building a lip reading model that supports a variable-length sentence as its input. We build the model using deep learning, specifically spatiotemporal Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) that both respectively form spatiotemporal feature extractor and character-level sentence decoder. During the process, we also investigate whether knowledge on lip reading on other language affects the acquisition of a different language. To the best of our knowledge, our model was the first sentence level Indonesian language lip reading that supports variable-length input. Our model achieved superhuman performance on all metrics, with almost 2× better word accuracy.
Word boundary detection is one of the primary components in speech recognition system, which can ... more Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information , using 3−Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multi-modal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.
—Advancement of Automatic Speech Recognition (ASR) relies heavily on the availability of the data... more —Advancement of Automatic Speech Recognition (ASR) relies heavily on the availability of the data, even more so for deep learning ASR system which is at the forefront of ASR research. A multitude of such corpus has been built to accommodate such need, ranging from single modal corpus which caters the need for mostly acoustic speech recognition, with several exceptions on visual speech decoding, to multimodal corpus which provides the need for both modalities. Multimodal corpus was significant in the development of ASR as speech is inherently multimodal in the very first place. Despite the importance , none of this corpus was built for Indonesian language, resulting in little to no development of visual-only or multimodal ASR systems. This research is an attempt to solve that problem by constructing AVID, an Indonesian audiovisual speech corpus for multimodal ASR. The corpus consists of 10 speakers speaking 1,040 sentences with a simple structure, resulting in 10,400 videos of spoken sentences. To the best of our knowledge, AVID is the first audiovisual speech corpus for the Indonesian language which is designed for multimodal ASR. AVID was heavily tested and contains overall low errors in both modality tests, which indicates the high quality and suitability of the corpus for building multimodal ASR systems.
Cheating during exams is a problem in the field of education. Cheating during exams undermine the... more Cheating during exams is a problem in the field of education. Cheating during exams undermine the efforts to evaluate the student's proficiency and growth. We propose a real-time cheating detection system using video feed that allows the ability to monitor students during written exams for any illegal behaviors and gestures, such as giving codes, looking at friends, using a cheat sheet, talking and exchanging papers between students. The gestures recognized during the runtime of the video from sequences of actions performed by the subjects which are then used to generate textual descriptions based on the detected cheating gestures. These textual descriptions help the process of documenting activities that transpired during the exams for later use. Our proposed system comprises two primary subsystems, a gesture recognition model based on 3DCNN and XGBoost and a language generation model based on an LSTM network. The gesture recognition model achieves recognition of the cheating gestures with 81.11% accuracy and Kappa statistic 0.760. The language generation model achieves 95.3 % word accuracy and average edit distance 1.076 on single subject description sentences, and 96.6% word accuracy and average edit distance 3.305 on interaction description sentences. The system runs at 32.54 fps on a mid-range laptop.
In this paper, we propose a workflow and a machine learning model for recognizing handwritten cha... more In this paper, we propose a workflow and a machine learning model for recognizing handwritten characters on form document. The learning model is based on Convolutional Neural Network (CNN) as a powerful feature extraction and Support Vector Machines (SVM) as a high-end classifier. The proposed method is more efficient than modifying the CNN with complex architecture. We evaluated some SVM and found that the linear SVM using L1 loss function and L2 regularization giving the best performance both of the accuracy rate and the computation time. Based on the experiment results using data from NIST SD 19 2 nd edition both for training and testing, the proposed method which combines CNN and linear SVM using L1 loss function and L2 regularization achieved a recognition rate better than only CNN. The recognition rate achieved by the proposed method are 98.85% on numeral characters, 93.05% on uppercase characters, 86.21% on lowercase characters, and 91.37 on the merger of numeral and uppercase characters. While the original CNN achieves an accuracy rate of 98.30% on numeral characters, 92.33% on uppercase characters, 83.54% on lowercase characters, and 88.32% on the merger of numeral and uppercase characters. The proposed method was also validated by using ten folds cross-validation, and it shows that the proposed method still can improve the accuracy rate. The learning model was used to construct a handwriting recognition system to recognize a more challenging data on form document automatically. The pre-processing, segmentation and character recognition are integrated into one system. The output of the system is converted into an editable text. The system gives an accuracy rate of 83.37% on ten different test form document.
Human gender detection from body profile is an important task for surveillance. Most surveillance... more Human gender detection from body profile is an important task for surveillance. Most surveillance cameras are placed at a distance such that it is not possible to see people's face clearly. In this paper, we report the comparison between fast-feature pyramids and deep region-based convolutional neural network (RCNN) to detect a person in surveillance images. Since RCNN performs better in detecting a person, further training is applied to the RCNN to detect man and woman. Transfer learning strategy is used due to a small number of training images. The result shows that the trained RCNN can detect man and woman with promising result.
In Indonesia, based on the result of Basic Health Research 2013, the number of stroke patients ha... more In Indonesia, based on the result of Basic Health Research 2013, the number of stroke patients had increased from 8.3‰ (2007) to 12.1‰ (2013). These days, some researchers are using electroencephalography (EEG) result as another option to detect the stroke disease besides CT Scan image as the gold standard. A previous study on the data of stroke and healthy patients in National Brain Center Hospital (RS PON) used Brain Symmetry Index (BSI), Delta-Alpha Ratio (DAR), and Delta-Theta-Alpha-Beta Ratio (DTABR) as the features for classification by an Extreme Learning Machine (ELM). The study got 85% accuracy with sensitivity above 86% for acute ischemic stroke detection. Using EEG data means dealing with many data dimensions, and it can reduce the accuracy of classifier (the curse of dimensionality). Principal Component Analysis (PCA) could reduce dimensionality and computation cost without decreasing classification accuracy. XGBoost, as the scalable tree boosting classifier, can solve real-world scale problems (Higgs Boson and Allstate dataset) with using a minimal amount of resources. This paper reuses the same data from RS PON and features from previous research, preprocessed with PCA and classified with XGBoost, to increase the accuracy with fewer electrodes. The specific fewer electrodes improved the accuracy of stroke detection. Our future work will examine the other algorithm besides PCA to get higher accuracy with less number of channels.
The need for segmentation and labeling of sequence data appears in several fields. The use of the... more The need for segmentation and labeling of sequence data appears in several fields. The use of the conditional models such as Conditional Random Fields is widely used to solve this problem. In the pattern recognition, Conditional Random Fields specify the possibilities of a sequence label. This method constructs its full label sequence to be a probabilistic graphical model based on its observation. However, Conditional Random Fields can not capture the internal structure so that Latent-based Dynamic Conditional Random Fields is developed without leaving external dynamics of inter-label. This study proposes the use of Latent-Dynamic Conditional Random Fields for Gesture Recognition and comparison between both methods. Besides, this study also proposes the use of a scalar features to gesture recognition. The results show that performance of Latent-dynamic based Conditional Random Fields is not better than the Conditional Random Fields, and scalar features are effective for both methods are in gesture recognition. Therefore, it recommends implementing Conditional Random Fields and scalar features in gesture recognition for better performance.
—SIBI (Sistem Isyarat Bahasa Indonesia) is the commonly used sign language in Indonesia. SIBI, wh... more —SIBI (Sistem Isyarat Bahasa Indonesia) is the commonly used sign language in Indonesia. SIBI, which follows Indonesian language's grammatical structure, is a complex and unique sign language. A method to recognize SIBI gestures in a rapid, precise and efficient manner needs to be developed for the SIBI machine translation system. Feature extraction method with space-efficient feature set and at the same time retained its capability to recognize different types of SIBI gestures is the ultimate goal. There are four types of SIBI gestures: root, affix, inflectional and function word gestures. This paper proposed to use heuristic Hidden Markov Model and a feature extraction system to separate inflectional gesture into its constituents, prefix, suffix and root. The separation reduces the amount of feature sets that would otherwise as big as the product of the prefixes, suffixes and root words feature sets of the inflectional word gestures.
Aside from the proper usage of grammar, diction and punctuation, a good essay must have cohesion ... more Aside from the proper usage of grammar, diction and punctuation, a good essay must have cohesion and coherence. In persuasive essay, argumentative discourse is important as the parameter to see the cohesion and coherence among the arguments. An argument is characterized by one's stance (claim) which is strengthened with facts (premises) to complete the validity of the stance. Ideally, claims have to be followed by premises either they support or attack the claims. In this paper, we try to identify 4 kinds of argument components (major claim, claim, premise, and non-argumentative) using some predefined features and measure the performance of word vector representation utilization in identifying argument components. We also present the results of our initial experiment by using deep learning to classify the argument components.
—This research aims to classify cheating activity during exam from video observation. The method ... more —This research aims to classify cheating activity during exam from video observation. The method uses Conditional Random Field (CRF) for classifying and detecting some classes of cheating activities. The method used to detect the location of the joints of the body is a Multimodal Decomposable Model (MODEC) with superpixel segmentation. The used joints are head, shoulders, elbows, and wrists. The superpixel method is Simple Linear Iterative Clustering (SLIC). Comparison between MODEC and MODEC + SLIC as feature detector for CRF showed that MODEC + SLIC capable of providing a better activity classification. From our experiments, the cheating activities in average can be detected up to 83.9%. Moving beyond only detecting the class of motion segments, we also devised point-in-time event detection system also using CRF. The time of occurrences of three consecutive cheating activities are determined from a sequence of video frames.
Uploads
Papers by Mohamad Ivan Fanany
All links to the paper are given in the Journal Names. I also computed the average on each column.
I randomly select 5 most popular and recent papers from each Journal to compute Review Time, Revision Time, and Publication Time in days.
The Detailed Review Time Analysis can be found at the following link.
Open Access Fee for IEEE Journals varied. For hybrid Journals: $1750. For fully open access journals: start from $1350.
I computed Total Impacts per Total Review Time in months. Hope this would be helpful.
All links to the paper are given in the Journal Names. I also computed the average on each column.
I randomly select 5 most popular and recent “Regular” papers from each Journal to compute Review Time, Revision Time, and Publication Time.
Sometimes, the Revised date or the Publication date were not given in the paper but only the Accepted date (thus adaptation is necessary).
All links to the paper are given in the Journal Names. I also computed the average on each column. Open Access Fee for Hindawi Journals varied and for some even free.
I computed Average Impacts per Total Review Speed in Months. Hope this would be helpful. Unfortunately, unlike Elsevier and Springer Journals, not all Hindawi's Journal are indexed by Scopus (which is a must requirement in our university and government in Indonesia)
Motivations:
Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks.
Their expressiveness is the reason they succeed but also causes them to learn uninterpretable solutions that could have counter-intuitive properties.
the most interesting items from a larger set.
Basic Idea:
If users shared the same interests in the past – if they viewed or bought the same books, for instance – they will also have similar tastes in the future.
Selection of hopefully interesting books involves filtering the most promising ones from a large set and because the users implicitly collaborate with one another, thus called collaborative filtering (CF).
Pure CFs do not exploit or require any knowledge about the books themselves.
Advantage: the data do not have to be entered into the system.
Shortcomings: using such characteristics to propose books that the user like might be more effective.
and process of Conditional Random Fields (CRF). This worksheet is based on a very
excellent tutorial on CRF by [Edwin Chen](http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/).
I hope this worksheet will clarify the tutorial. I add other two feature functions
to make the example looks more realistic. I hope to extend this worksheet latter
to the dynamic programming part, HCRF, and deep structured CRF.