The initial design of the CNN-Barley architecture requires adjustment to suit the specific application, image classes, problem complexity, and size of the training dataset. Factors such as the training method, activation function, feature selection, normalization techniques, filter size, number of filters, and layer count can significantly impact both the classification accuracy and efficiency of the designed network. Consequently, a series of experiments were conducted to compare accuracy and speed across various network configurations and different hyperparameter values that define the network’s structure.
Given the belief that recognizing subtle varietal characteristics is more challenging than detecting damaged kernels, the optimization process for CNN-Barley was primarily undertaken using an image collection containing diverse varieties of kernels. Subsequently, the resulting network structure, after appropriate adaptation of the final classification layers, was also employed for the classification of defective kernels.
In the initial experiment, we conducted a comparison of optimization algorithms, which included stochastic gradient descent (SGD), Adam, and Nadam. It is important to highlight that both the Adam and Nadam algorithms are enhancements of the SGD algorithm, designed to refine the learning process. All three algorithms were subjected to testing with varying optimization step lengths. In each case, the training phase was halted when the classification accuracy for the validation set showed a decline over the subsequent two training epochs. The final state of the network was determined by selecting the configuration that achieved the highest classification accuracy for the validation set during training.
The outcomes of these comparisons are detailed in Table 1. Slightly elevated classification accuracies were achieved using the SGD method. As a result, there was no confirmation of the Adam and Nadam algorithms having an advantage over SGD. It should be noted that all three algorithms underwent testing with different optimization step lengths. Typically, reducing the step size to a certain value leads to improved classification accuracy at the expense of longer training time. However, further decreasing the step size does not yield significant improvements. Therefore, the presented results were obtained using relatively small optimization step values, where reducing them further did not substantially increase the achieved accuracies.
In the designed CNN-Barley architecture (Fig. 5), it was assumed that the resolution of input images and feature maps in successive network layers would be reduced by a factor of two in both the vertical and horizontal directions. This practical approach enables the construction of a network comprising a maximum of six layers. Beyond this number, the number of features in the final layers would be insufficient. The primary objective of the subsequent experiment was to ascertain the optimal number of CONV layers. Different network configurations, ranging from 2 to 6 CONV layers, were examined. Notably, the highest classification accuracy of barley varieties, exceeding 0.9, was achieved with a network employing four CONV layers. Consequently, a network featuring this number of layers was adopted for subsequent experiments.
Subsequently, the network’s performance was assessed by comparing different sizes (3 × 3, 5 × 5, and 9 × 9) and quantities (4, 8, 16, 32, 64, 128, and 256) of filters in the initial CONV layer. The outcomes of these experiments are presented in Table 2, illustrating the results for specific combinations of filter counts and sizes. Notably, the highest classification accuracy was attained using relatively compact 3 × 3 filters, with a count of 128 filters in the first layer. Adding more filters began to curtail the network’s ability to generalize knowledge and led to diminished classification accuracy for both the validation and test datasets. It was also observed that employing 64 filters resulted in a 25% faster classification process and reduced memory usage, with only a minor decline in accuracy. Consequently, a configuration featuring 64 filters of size 3 × 3 was selected for further experimentation.
In the subsequent experiment, a comparison was conducted involving various activation functions, encompassing the sigmoidal function, rectified linear unit (ReLU), error linear unit (ELU), and Gauss error linear unit (GELU). The classification accuracies attained for each activation function are presented in Table 3, with ReLU outperforming others to achieve the highest accuracy.
The finalized structure of the CNN-Barley is depicted in Fig. 4. The architecture consists of four successive blocks, each incorporating a CONV layer, a ReLU activation function layer, and a MaxPool layer responsible for selecting features with the highest values. As the blocks progress, the size of the image maps reduces by a factor of four, while their quantity (referred to as channels) increases. After the final MaxPool layer, 256 channels are retained. These 256 feature values are then fed into a multilayer perceptron (MLP) network, with layers designated as fully connected (FC). The FC configuration employs one hidden layer and one output layer. The output layer employs neurons with a linear output function – Softmax. A dropout technique of 30% is applied to eliminate insignificant connections between these two layers. Incorporating the dropout technique during training facilitates an increase in the classification accuracy for barley varieties, reaching up to 0.92.
The dimensions of the input layer in the designed network vary based on the size of the input images. For C and B-type images, the dimensions are set at 170 × 170 pixels. In contrast, for R-type images, the dimensions are 170 × 80. Both cases involve the analysis of color images featuring three RGB color channels. Conversely, the number of neurons in the output layer must correspond with the number of classes. Activation of a specific neuron within this layer signifies the recognition of the image’s association with its respective class. In cases involving variety recognition, the network has six outputs, whereas for defective kernel detection, a network with five outputs is employed.
In the subsequent experiment, a comparison was conducted between the developed CNN-Barley and three other state-of-the-art methods. The first one involved a traditional approach encompassing feature extraction, feature space reduction, and classification. For this purpose, the QMaZda software package was employed45,46,47. The remaining two methods employed CNNs with weight transfer from the AlexNet and ResNet networks, respectively. Therefore, throughout the rest of this paper, the reference methods will be denoted as QMaZda, AlexNet, and ResNet.
In the traditional approach (QMaZda), image feature extraction was carried out, encompassing the calculation of morphological, color, and texture features. Subsequently, a feature selection process (using a wrapper approach) was employed to identify feature groups with the highest discriminative capabilities. The discriminative power of these features was evaluated based on Fisher’s discriminant, derived from linear discriminant analysis.
The selection procedure was executed separately for the images representing different varieties and for the images of defective kernels. Additionally, as feature selection constitutes part of the machine learning process, the selection procedure was repeated for each of the 20 cross-validation instances. As a result, during each instance, feature groups consisting of 30 to 50 of the most discriminative features were chosen. Vectors of the selected feature values, extracted from images of the training set, were used for training SVM classifiers.
In the case of AlexNet and ResNet, models with 18 CONV layers were employed. This choice was regarded as a reasonable trade-off, taking into consideration the model’s complexity and the available hardware resources. Network models with input layers of 224 × 224 pixels were utilized. To align the existing images with the input layer size, black pixel margins were added around the original images. Furthermore, the number of neurons in the output layers of the network was adjusted to match the recognized classes, either 5 or 6.
Two approaches were employed for training the network. In the first approach, the training process was confined to the FC layers responsible for data classification, while the weights of the CONV layers remained unaltered. In the second approach, training was extended to the CONV layers, encompassing fine-tuning of filters. This latter approach resulted in a 2 to 5% point increase in classification accuracy. Due to the significant number of weights present in the architectures of AlexNet and ResNet, there arose a necessity to enlarge the training datasets. Augmentation of these datasets was applied, involving random rotations within a 5° limit, shifts, and mirror-image flipping of the original images. This augmentation contributed to enhancing the classification outcomes for both the reference methods and the developed CNN-Barley.
Table 4 presents the classification accuracy values obtained for the developed neural network and three reference solutions. Since, fine-tuning of filters in AlexNet and ResNet emerged as the superior solution, the outcomes achieved through fine-tuning are presented. All the accuracy values provided are averages derived from the 20 cross-validation repetitions.
Table 5 displays the confusion matrices for the classification of healthy and defective kernels using CNN-Barley and QMaZda. The set of B-type images (Fig. 3) were used in this experiment. The rows of the matrices correspond to the actual classes, while the columns represent the predictions. The values are presented as percentages of predictions relative to the number of kernels in the actual classes, and values in parentheses indicate the standard deviations calculated from 20 repetitions of the learning process. These matrices enable the evaluation of the correct prediction ratio for individual classes and provide a quantitative representation of classification errors between the actual and predicted classes. In most instances, the confusion matrices reaffirm the superiority of the proposed CNN-Barley method. The only exception lies in the recognition of broken kernels, where QMaZda slightly outperforms the CNN.
It is important to highlight that in the classification of broken kernels using the QMaZda method, the most discriminative features were the kernel’s area and elongation ratio. These features fall into the category of morphological (shape descriptive) attributes. In contrast, such shape characteristics cannot be computed by the CONV layers of the CNNs.
All experiments were conducted using a computer equipped with an Intel® Core i7-4930 K 3.40 GHz processor (CPU) and an NVIDIA GeForce GTX 780 Ti graphics card (GPU), operating under Linux-Kubuntu 20.04. Image preprocessing, feature extraction, and SVM classification utilized the CPU, while the majority of computations associated with deep neural networks were performed on the GPU. Given the diverse computational resources used by the compared image classification methods, it becomes challenging to gauge their efficiency purely based on time measurements. Nonetheless, information on the analysis time can be beneficial when deciding which method to deploy on modern personal computers, typically offering both CPU and GPU technologies.
To compare the performance of the proposed solutions, we suggest focusing on model complexity (Table 6). The number of parameters is crucial, as it influences the model’s complexity, its ability to generalize, and the computational demands during training. Reducing the complexity of the classification model is one way to mitigate the risk of overfitting to the training data.
For the conventional image classification approach (QMaZda), the analysis time for a single grain remains within 380 ms per computational thread. This encompasses the time needed for preprocessing a pair of images, taking around 100 ms, the computation of a selected feature group ranging from 90 to 250 ms (dependent on the number and type of features selected), and the classification process itself, requiring approximately 30 ms. Leveraging the multiple cores of the processor, computations can be run concurrently for several kernel images, resulting in average analysis times of 25–50 ms per image.
In CNN-Barley, the average analysis time per kernel ranges from approximately 0.4 to 0.7 ms, depending on the input image size, which varies based on the method used to acquire the research material. In comparison, the classification process in AlexNet and ResNet requires 2.5 ms and 4.8 ms, respectively. It is important to note that these times do not account for the image preprocessing performed by the CPU or the time needed to transfer images to GPU memory.
Currently, image preprocessing is carried out on a CPU, which is well-suited for handling diverse computational tasks. However, there are methods that allow for the implementation of preprocessing on a GPU48,49. By leveraging GPU capabilities, it is possible to reduce preprocessing time, thereby enhancing the overall efficiency of the image processing pipeline. Transitioning from CPU to GPU for preprocessing can result in significant performance improvements, particularly in applications that require real-time processing.