1 Introduction

Nowadays, machine learning (ML) is being employed more to tackle and delve deeper into the harder problems in quantum information science. In recent years, it has been applied in state classifications [1,2,3], state reconstruction [4], parameter estimation [5], and many others [6,7,8,9,10,11,12,13]. The motivation behind using ML in quantum information is to get more insights into the problems where usual numerical techniques either fail or need more resources, eg., the optimization tasks in high constraint or non-convex scenarios.

To decide whether an arbitrary quantum state is entangled or not is an NP-hard problem [14]. It is one of the long-standing fundamental issues in entanglement theory. A state of a composite system \(\rho _{AB}\) is said to be separable if \(\rho _{AB} = \sum _i p_i\rho _{A}^i \otimes \rho _{B}^i\) for any two subsystems A and B, where \(p_i\) \((\ge 0)\) represents classical mixing probability with \(\sum _ip_i=1\). Otherwise, it is an entangled state. There exist numerous criteria to detect bipartite entanglement;e however, these criteria are less reliable for higher-dimensional systems. For example, the popular Peres-Horodecki criteria state that the separable states are positive partial transpose (PPT) [15, 16], meaning for separable states \(\rho _{AB}^{T_A}\ge 0\), where \(T_A\) denotes transposition on system A. The criteria are necessary and sufficient for \(d_Ad_B\le 6\), where d denotes system dimension. Other extant method includes entanglement witness, reduction criteria, cross-norm, or realignment criteria to name a few [17]. The most powerful technique is k-extension hierarchy, but it is notoriously hard to compute due to its exponentially growing complexity with k [18, 19]. Recently, in Ref.[1], it was studied that ML techniques are instrumental in probing separability-entanglement classification. It was established that the ML-based technique is more efficient in terms of speed and accuracy than all extant methods. A couple more ML-based techniques were well studied for quantum separable-entanglement classification using artificial neural networks [20, 21].

Ref. [1] employed the convex hull approximation (CHA) to probe the separability-entanglement boundary using a supervised learning scheme. To reduce the error in classification using CHA, the bagging method [22] was invoked. This new method is known as bagging CHA (BCHA). This method increases the speed and accuracy of data manipulation as it divides the whole process into smaller units, and then runs in parallel. Ref. [1] demonstrates their results for two-qubits and two-qutrit systems with fairly high accuracy.

In this work, building on the approaches of Ref. [1], we propose an alternative method that addresses some important issues with further accuracy improvements for the separability-entanglement classification using ML. First, a) we notice that the earlier work does not address the issue of handling data imbalance, and b) did not explore all extant performance measures in their study. In what follows, we find that there are some performance measures which are more relevant to the study of separability problem. Also, we show in this draft that a proper ML classifier with boosting can handle the class imbalance issue by optimally balancing between bagging and boosting methods.

2 Setting up the stage

2.1 Supervised learning

Supervised learning is a method of developing artificial intelligence that involves training a computer algorithm on input data that has been labeled for a certain output [23]. In order to apply it to real-time data, the model is trained until it can discover the underlying patterns and relationships between the input data and the output labels, allowing it to produce accurate classification results.

For supervised learning, the system is supplied with labeled datasets throughout its training phase, which tell it what output is associated with each specific input set. The trained model is then evaluated with test data, which is labeled data with the labels hidden from the algorithm [24]. Further, the unlabeled testing data are used to determine how well the algorithm performs the classification task [25].

To create the learning dataset, we consider bipartite quantum state \(\rho _{AB}\) of dimension \(d_A\otimes d_B\) in \(\mathcal {H}_A\otimes \mathcal {H}_B\). Arbitrary density matrix \(\rho _{AB}\) \(\in \mathcal {H}_A\otimes \mathcal {H}_B\) can be represented by real vector \({\varvec{x}_i}\) \(\in \mathcal {V}\) (\(=\mathbb {R}^{d_A^2d_B^2-1}\)) as \(\rho ^\dagger =\rho \) and \(\textrm{Tr}[\rho ]=1\). We call such a vector feature vector [see Appendix A for detail]. The training dataset is then defined as \(\Omega _\textrm{train}=\{(\varvec{x}_i,y_i)|i=1,\cdots n\}\), where \(x_i\) is the \(i^{th}\) sample and \(y_i\) is its corresponding class label, which is represented as, \(y_i=1(0)\) if it is separable (entangled). Data labeling for \(d_Ad_B\le 6\) is performed by using PPT criteria. However, for higher dimensions, the labeling is done as per the Appendix-C of Ref. [1].

In supervised learning, the main aim is to find a classifier (indicator function) \(\Theta : \mathcal {V}\rightarrow \{0,1\}\) which will fit the training data at best among a class of functions \(\mathcal {F}\). As the present quantum entanglement is a binary classification problem, the error expresses the miss classification rate over two classes. For any training data \(\Omega _\textrm{train}\) consisting of n samples, each associated with feature vector \(\mathcal {V}\) and a target class label \(y_i\) (\(\in \{0,1\}\)); the loss function \(\mathbb {L}\) for any binary classifier \(\Theta \) can be represented as

$$\begin{aligned} \mathbb {L}(\Theta ,\Omega _\textrm{train})=\frac{1}{n}\sum _{i=1}^n \mathbbm {1}[y_i \ne \Theta (\varvec{x}_i)], \end{aligned}$$

where \(\mathbbm {1}[\cdot ]\) is a truth function of its argument. For any test data \(\Omega _\textrm{test}\), the value of function \(\mathbb {L}(\Theta ,\Omega _\textrm{test})\) depicts the generalization error from \(\Omega _\textrm{train}\) to \(\Omega _\textrm{test}\).

It was found that among numerous extant supervised learning algorithms, eg., support vector machine (SVM) [26], decision tree [27], boosting [28], etc. do not provide acceptable accuracy for separability problem [1]. This is due to the complex structure of the set of separable states. This led authors of Ref. [1] to the following consideration.

2.2 Combining CHA with supervised learning

The set of all separable states, \(\Omega _1\), is convex and compact, and its exterior points are all pure product states. Using this fact, one can sample \(\Omega _1\) using convex hull (\(\mathbb {C}\)) of m number of product states, \(\{\varvec{c}_i\}\in \mathcal {V}\), i.e., \(\mathbb {C}:=\textrm{conv}\{\varvec{c}_i~|i=1,\dots , m\}\). The \(\mathbb {C}\) is the CHA of \(\Omega _1\), and one can decide if an unknown state \(\rho \) is separable or not by examining whether its feature vector \(\varvec{x}\) is in \(\mathbb {C}\). Equivalently, it is the solution of following linear programming:

$$\begin{aligned}&\max ~ \alpha ~~~ \mathrm{s.t}. ~~~\alpha \varvec{x} \in \mathbb {C},~~~\mathrm{i.e.},\nonumber \\&\alpha \varvec{x} = \sum _{i=1}^{|\mathbb {C}|}\lambda _i\varvec{c}_i, ~~ \lambda _i\ge 0, ~~\sum _{i}\lambda _i=1, \end{aligned}$$
(1)

where \(\alpha \) has functional dependence on both \(\mathbb {C}\) and \(\varvec{x}\). If \(\varvec{x}\) is in \(\mathbb {C}\), then the corresponding state, \(\rho \), is separable, else \(\rho \) is an entangled state with high possibility. More specifically \(\rho \) is separable when \(\alpha \ge 1\) and entangled otherwise. We denote a maximal \(\alpha \) for a chosen m-value as \(\alpha _{\max }^m\). If we increase m (to better approximate \(\mathbb {C}\)), we will achieve better classification. It is evident that adding more exterior points in convex approximation will increase the accuracy of the above algorithms, however, it is really time-consuming. To overcome this, Ref. [1] used CHA in combination with supervised learning. Now, training data are defined as \(\Omega _\textrm{train}=\{(\varvec{x}_i,\alpha _i,y_i)|i=1,\dots , n\}\) and the loss function of classifier \(\Theta \) is redefined as

$$\begin{aligned} \mathbb {L}(\Theta ,\Omega _\textrm{train})=\frac{1}{n}\sum _{i=1}^n \mathbbm {1}[y_i \ne \Theta (\varvec{x}_i,\alpha _i)]. \end{aligned}$$
(2)

Where \(\alpha _i\) is the outcome of CHA for i-th random density matrix after solving the linear programming for finding \(\varvec{x}\) in \(\mathbb {C}\). Note that, CHA uses a threshold \(\alpha \ge 1\) to classify as 1(0). The values of \(\alpha \) acts as another feature for the classifier to learn the model. In Ref [1] bagging-based classification is performed on this feature space, known as bagging CHA (BCHA). More information on the bagging and boosting approaches is discussed further.

2.3 Overview of bagging and boosting classifiers

An ensemble meta-estimator called a bagging classifier fits base classifiers one at a time to random subsets of the original dataset, and then it aggregates the individual predictions (either by voting or by averaging) to provide a final prediction. By adding randomization to the process of building a black-box estimator (such as a decision tree), a meta-estimator of this kind can often be used to lower the variance of the estimator.

A training set is created by randomly selecting M instances (or pieces of data) from the original training dataset (of size N), and used to train each base classifier in parallel. Each base classifier’s training set is distinct from the others. In the resultant training set, many of the original data might be replicated while others might not. An overview of bagging classifiers is presented in Fig. 1.

Fig. 1
figure 1

Overview of bagging classifier: multiple learners are created by generating additional data points. The new data points are created randomly with a uniform probability as before. Generally, the created N learners are parallel and are further averaged to obtain the final learning error defined as \(e=\frac{1}{N}\sum _{i=1}^{N} e_i\)

A number of weak classifiers are combined in the broad ensemble approach known as "boosting" to produce a strong classifier. In order to do this, a model is first constructed using the training data, and a second model is then developed in an effort to fix the errors in the first model. The training set is predicted exactly or a predetermined number of models are added, depending on which comes first. AdaBoost [29] was the first really successful boosting algorithm developed for binary classification. An overview of boosting classifiers is presented in Fig. 2.

Fig. 2
figure 2

Overview of boosting classifier: similar to the bagging approach, the boosting classifier also generates multiple data points. But, unlike parallel in bagging, the boosting approach sequentially learns the error from the previous learner and assigns a higher weight to the miss classified data, and random sampling with weighted replacement is carried out. Also, another set of weights assigned to the learners are further accumulated to find the final weighted average error defined as \(e=\sum _{i=1}^{N} w_i e_i\)

Both boosting and bagging fall under the category of "ensemble learning." Combining many weak learners to create a hybrid categorization system. Most often, "ensemble learning" refers to trained weak decision ensemble trees.

2.4 Imbalanced dataset

Imbalanced dataset refers to an unequal distribution of class samples within a dataset. Such unequal distribution of class samples reduces the training performance of the classifiers, and hence the classification results on the testing data are also affected.

In the present context, the volume of entangled states is far more than the separable states, making the dataset imbalanced. For more details on the experimented datasets, see Sect. 4.1. From the discussion in Sect. 4.1, we can observe that the prevalence differences are high for both datasets and hence they are highly imbalanced.

This demands a classifier that can handle data imbalance issues and can be more suitable for quantum separability-entanglement classification problems. Which is discussed in the next section.

Also, for such imbalanced datasets, the learning performance of any ML approach is greatly affected [30] and needs a careful performance evaluation. Such performance measures are discussed in Sect. 4.2.

2.5 Ensemble classifiers for imbalanced dataset

It has been well studied that, for imbalanced data, the SVM classifier may be biased toward the majority class [31]. A modification of SVM has already been presented, incorporating random under-sampling (RUS) for an unbalanced dataset [32] by removing the samples randomly from the training set. For highly unbalanced data, synthetic minority oversampling technique (SMOTE) [33, 34] has been applied toward classification, where it generally over-samples the minority class to create synthetic data points. So further incorporation of SMOTE to boosting approach may be effective for classification. When oversampling is performed by duplicating examples, it may lead to over-fitting [35]. So, further modification by incorporating the under-sampling may help in the performance improvement of the classifier. Instead of over-sampling the minority classes, under-sampling the majority classes also may help in improving the classifier results. The RUS randomly removes examples from the majority class until the desired class distribution is found [36]. Such integration with boosting is RUSBoost [36], which is a hybrid approach combining random under-sampling, SMOTE, and Adaptive Boost (AdaBoost) classifier.

For ensemble learning, bagging and boosting are generally applied (see Figs. 1 and  2). Already the bagging-based CHA (BCHA) is proposed [1], reporting higher accuracy than CHA. But, as the data are highly unbalanced, the accuracy evaluation should be twofold– 1) overall accuracy (OA) and 2) average accuracy (AA). For more details on the performance measures OA and AA, see 4.2. OA is the number of correctly classified test samples per total samples under test while AA is the sum of accuracy for each class predicted per the total number of classes (average of each accuracy per class). Hence, although the reported OA [1] is higher, we evaluated the AA of BCHA, which is of less margin than the CHA approach. This demands further improvement in the classifier which can take care of both the OA and AA for separability-entanglement classification.

As the experimented dataset is highly unbalanced (refer Sect. 4.1), the RUSBoost approach is explored for separability-entanglement classification and is validated over the state-of-the-art approaches. The subsequent section describes the RUSBoost ensembled CHA classifier.

3 RUSBoost CHA (RUSBCHA)

Initially, all examples in the training dataset are assigned equal weights. During each iteration of AdaBoost, a weak hypothesis is formed by the base learner. The error associated with the hypothesis is calculated, and the weight of each example is adjusted such that wrongly classified examples have their weights increased while correctly classified samples have their weights decreased. Therefore, subsequent iterations of boosting will generate hypotheses that are more likely to correctly classify the previously mislabeled examples. After all, iterations are completed, and a weighted vote of all hypotheses is used to assign a class to the unlabeled samples.

Data sampling techniques attempt to alleviate the problem of class imbalance by adjusting the class distribution of the training dataset. This can be accomplished by either removing examples from the majority class (under-sampling) or adding examples to the minority class (oversampling).

SMOTE adds new artificial minority examples by extrapolating between preexisting minority instances rather than simply duplicating original examples. The newly created instances cause the minority regions of the feature space to be fuller and more general.

The RUSBoost takes advantage of all these approaches by combining them. A detailed discussion on the RUSBoost approach can be found in [36].

Although significant classifier performance improvement is observed [1] in the case of BCHA as compared to standalone CHA, some limitations exist which are discussed in Sect. 1. So, it can be further improvised in two ways 1) by replacing the classifier and 2) by increasing the feature space by proper feature extraction technique. Presently the first case is explored by incorporating the RUSBCHA classifier for possible improvement in the classification results leaving scope to explore the feature extraction techniques as future work.

4 Experimental setup

All the classifications were carried out on two kinds of feature spaces 1) vector represented \(\rho \) (\(d^2-1\)-dimensional feature space), 2) vector represented \(\rho \) with CHA calculated \(\alpha _{\max }^m\) for a specific m (\(d^2\)-dimensional feature space). The experiments are carried out for both the two-qubit and two-qutrit systems. Five different techniques such as bagging and boosting were tested on raw \(d^2\)-1-dimensional (for two-qubit system d=4 and for two-qutrit system d=9) feature vector \(\varvec{x}\), CHA with only one \(\alpha _{\max }^m\), while the BCHA and RUSBCHA are trained with both the \(\varvec{x}\), and \(\alpha _{\max }^m\). Their associated feature spaces are presented in Table 1.

Table 1 Various experimented classifiers with their associated feature space (dimensions)

The dataset details and the performance evaluators are presented below.

4.1 Dataset preparation

The total data space \(\Omega \) is a combination of the separable subspace \(\Omega _{1}\) and entangled subspace \(\Omega _{0}\); such that \(\Omega =\Omega _{1}\cup \Omega _{0}\) and \(\Omega _{1}\cap \Omega _{0}=\emptyset \) (see Fig 3). Two datasets, representing the feature vectors of random density matrices for two-qubit and two-qutrit systems, respectively, are supplied with their class labels in [37]. The procedure for creating the random separable and entangled states can be referred to in the BCHA manuscript [1]. The total and class-specific training and testing sample information for the pair of the experimented datasets, namely two-qubit and two-qutrit system, are presented in Table 2 and Table 3, respectively. Approximate 50% samples are randomly selected for training and the remaining 50% samples are used for testing to evaluate the performances of ML algorithms.

The maximized parameter, \(\alpha _{\max }^m\) for CHA (with varying m) of 1) two-qubit system with \(m=[1000,2000,..,10000]\), and 2) two-qutrit system with \(m=[10000,20000,..,100000]\)) were also obtained from [1, 37]. The minimization was made by solving the linear programming defined in Eq.(1).

Fig. 3
figure 3

Data space \(\Omega \) as a combination of entangled \(\Omega _{0}\) and separable \(\Omega _{1}\) subspaces. \(c_i\) represents the pure product states

Table 2 Dataset description of experimented training, testing, and total samples for two-qubit systems
Table 3 Dataset description of experimented training, testing, and total samples for two-qutrit systems

From Tables 2 and 3, we can observe that the class samples are unequally distributed within the dataset. A prevalence difference for a binary classification represents the degree of imbalance in the dataset. The dataset-specific prevalence difference of class samples can be interpreted as, for:

  • Two-qubit dataset (Table 2): \(\left| \frac{2814}{40000} - \frac{37186}{40000}\right| =0.8593\).

  • Two-qutrit dataset (Table 3): \(\left| \frac{6751}{20000} - \frac{13249}{20000}\right| =0.3249\).

For a balanced dataset, the prevalence difference must approach 0. However, we can observe that the prevalence difference for the two-qubit dataset is high (0.86) and for the two-qutrit dataset, it is comparatively low (0.32). This clearly signifies that the experimented dataset is highly imbalanced. For such imbalanced datasets, the learning performance of any ML approach is greatly affected [30] and needs a careful performance evaluation. Such performance measures are discussed further.

4.2 Performance measures

For ease of understanding the binary classification, the confusion matrix is presented in Fig. 4. In the figure, columns represent the original class labels (supplied with the data) as true and false; similarly each row represents the outcome of the classifier.

Fig. 4
figure 4

Confusion matrix for binary classification

True positive (TP) and true negative (TN) are defined as both the original (ground truth) and the obtained (classified) class labels are true and false, respectively. The contradictions are presented as false positive (FP) and false negative (FN) which are off-diagonal in the confusion matrix. Let N number of samples be tested, i.e., \(N=\sum \left( \textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}\right) \). So, higher TP and TN values lead to better accuracy; on the contrary, higher FP and FN values reject the classifier.

Now we can define overall accuracy (OA) as

$$\begin{aligned} {\text {OA}} = \frac{{{\text {TP}} + {\text {TN}}}}{N}, \end{aligned}$$

and the overall error (OE) as OE=1-OA.

For binary classification, let, out of N tested samples, there are \(N_1\) and \(N_2\) samples labeled as true and false, respectively (where \(N=N_1+N_2\)). The average accuracy (AA) is the mean accuracy obtained for each class and is defined as

$$\begin{aligned} {\text {AA}} = \frac{1}{2}\left( {\frac{{{\text {TP}}}}{{N_{1} }} + \frac{{{\text {TN}}}}{{N_{2} }}} \right) . \end{aligned}$$

and the average error (AE) as AE=1-AA.

Similarly, other important measures such as sensitivity (\({s}=\frac{\hbox {TP}}{\hbox {TP}+\hbox {FN}}\)), specificity (\(r=\frac{\textrm{TN}}{N}\)), Precision (\(k= \frac{\textrm{TP}}{\mathrm{TP+FP}}\)), F-measure and G-mean can be incorporated for validating the classification results. We will use the following two for our analysis:

$$\begin{aligned} \text{ F-measure } = 2\left( \frac{k \times s}{k + s}\right) ,\,\,\text{ and }\,\, \text{ G-mean } = \sqrt{s \times r}. \end{aligned}$$

Higher values of OA, AA, F-measure, and G-mean are desirable for evaluating the performance of a classifier.

5 Results and discussion

We used both the datasets (see Sect. 4.1) and all the performance measures described in Sect. 4.2, to compare the proposed RUSBCHA and other state-of-art classifiers in terms of figures. For the robust representation of performances on the experimented data, all the classification performance measures are averaged over 30 independent evaluations.

The bagging and boosting classifier only incorporates the \(d^2\)-1-dimensional feature vector \(\varvec{x}\). The classification performance as; AE, F-measure, G-mean, and OE; for two-qubit and two-qutrit systems are presented in Fig. 5 a, b, respectively. For the two-qubit system (Fig. 5 a), it is observed that the proposed boosting approach outperforms the bagging approach in terms of F-measure, G-mean, and AE. While marginal deviation is observed for OE. Similarly, for the two-qutrit system (Fig. 5 b), improvement is observed for G-mean and AE.

Fig. 5
figure 5

Classification results of the raw data without considering the CHA (\(\alpha \)) for a two-qubit and b two-qutrit system

According to both the CHA and BCHA approaches, if \(\alpha _{\max }^m \ge 1\), \(\varvec{x}\) is separable; else, \(\varvec{x}\) is highly possible to be an entangled state. Hence, our proposed RUSBCHA classifier also incorporates both the feature vectors \(\varvec{x}\) and \(\alpha _{\max }^m\). To find the trade-off between the state-of-the-art BCHA and the proposed RUSBCHA approach, further experiments are made on both two-qubit and two-qutrit datasets. These experiments include:

  • Experiment 1: Performance evaluation of classifiers over varying m.

  • Experiment 2: Performance evaluation of classifiers over varying percentages of training and testing samples.

  • Experiment 3: Performance evaluation of classifiers on varying prevalence difference of dataset.

5.1 Experiment 1

In this experiment, the CHA, BCHA, and proposed RUSBCHA classifiers are compared over varying m for both two-qubit and two-qutrit datasets. Experimental results are shown in Figs. 6 and  7.

For a two-qubit system, from Fig. 6b, it can be observed that the AE of BCHA is higher for all values of m as compared to CHA and RUSBCHA approaches. The BCHA performance has almost 40% error for the lower value of m. It can also be observed that, for lower values of m, the performances of CHA and RUSBCHA are similar, while for higher values of m RUSBCHA has lower AE values. This clearly signifies that the proposed RUSBCHA is less biased to the majority classes, and hence, the average accuracy is higher in comparison with other state of approaches. A similar interpretation also can be seen in Fig. 6d.

From Fig. 6a, it can be observed that the OE of BCHA has lower values, and hence its performance is better for lower values of m in comparison with RUSBCHA and CHA approaches. While the proposed RUSBCHA has intermediate performance in comparison with other state of approaches. However, in Fig. 6c, the F-measure performances are equivalently similar for all approaches.

Fig. 6
figure 6

Classification results of the two-qubit system, considering the CHA (\(\alpha \))

On the other hand, for the two-qutrit system (Fig. 7), both the BCHA and RUSBCHA have similar performances over varying m with significant performance improvements as compared to the state-of-art CHA approach.

Fig. 7
figure 7

Classification results of the two-qutrit system, considering the CHA (\(\alpha \))

In this experiment, you can observe better performance of proposed RUSBCHA approach for two-qubit dataset in comparison with BCHA and CHA approaches while similar performances are observed for both RUSBCHA and BCHA for two-qutrit datasets. To find the rationale for performance differences of these two datasets, further experiments are carried out.

5.2 Experiment 2

In literature, it is proved that several machine learning techniques such as neural network and deep learning require a large number of samples to train. The above problem may occur due to the sensitivity of the classifier to the percentage of training samples. In experiment 1, 50% of samples are trained and the rest are tested. Hence, further validation of the approaches is carried out with varying training (10–50%) and testing (50–90%) scales, and the performances are presented in Figs. 8 and 9 for two-qubit and two-qutrit systems, respectively. Note that, for this experiment, the total samples are the same as Tables 2 and 3 for the respective datasets. In this experiment, m is set as 2000 and 20000 for two-qubit data and two-qutrit data, respectively.

From Fig. 8a, it can be observed that OA of BCHA is 2.5% more than RUSBCHA, while in Fig. 8b AA of RUSBCHA is more than 15% better than BCHA. However, the results of these classifiers do not vary by the variation in training percentages. Therefore, performance of both the classifiers is not sensitive to the number of training samples. For the two-qutrit data, in Fig. 9a and Fig. 9b, you can also observe similar results. However, the AA performances in Figs. 8b and  9b suggests that the RUSBCHA performs better than BCHA, specifically for two-qubit dataset. Note in this respect that the prevalence difference of the two-qutrit dataset (0.3249) which is comparatively low referring to the prevalence difference of the two-qubit dataset (0.8593) for this experiment. This further suggests that doing further experiments to test both the classifiers with varying prevalence difference ratios might provide us some clue on how these classifiers work for imbalanced datasets.

Fig. 8
figure 8

Obtained overall accuracy and average accuracy for the two-qubit system over varying percentage (%) of training samples (m=2000)

Fig. 9
figure 9

Obtained overall accuracy and average accuracy for the two-qutrit systems over varying percentage (%) of training samples (m=20000)

5.3 Experiment 3

The above experiments were performed with two-qubit and two-qutrit datasets as mentioned in Table 2 and Table 3, respectively. From these tables, you can observe that the separable samples are only 7% and 33% of the total samples for two-qubit and two-qutrit datasets, respectively. To test the performance of classifiers for different prevalence differences, we created imbalanced datasets of different prevalence differences for both two-qubit and two-qutrit.

Table 4 Description of imbalanced datasets created from the original two-qubit dataset of Table 2

Table 4 shows the description of created imbalanced datasets for two-qubits. In this table, each row describes a dataset which is a subset of the dataset described in Table 2. For each created dataset subset, its number of separable, entangled, and total samples are represented. Also for each entry in the table, the prevalence difference of the respective dataset is mentioned. One notices the prevalence difference values range approximately from 0 to 0.9. The value 0 represents the dataset is balanced, and value 0.9 represents the dataset is highly imbalanced. A similar interpretation for the two-qutrit dataset can be done from Table 5.

Table 5 Description of imbalanced datasets created from the original two-qutrit dataset of Table 3

Figure 10 shows the classifier performances over the varying prevalence of two-qubit data. In the figure, the performances are averaged over 30 iterations, and in each iteration, a new subset of the dataset is created with varying prevalence differences (Table 4). For this experiment, we fixed these parameters m=2000, and 50% training samples.

It is observed from Fig. 10a that the OA of both BCHA and RUSBCHA are similar up to 0.6 prevalence difference. However, afterward, there is a minor improvement of OA for BCHA approach in comparison with RUSBCHA approach. From Fig. 10b, it can be observed that both BCHA and RUSBCHA performances are similar up to 0.5 prevalence difference. However, afterward, there is a sharp decline of AA for BCHA in comparison with RUSBCHA.

Fig. 10
figure 10

Overall accuracy and average accuracy of BCHA and RUSBCHA over varying prevalence difference of the two-qubit data

Figure 11 shows the classifier performances over the varying prevalence of two-qutrit data. In the figure, the performances are averaged over 30 iterations, and in each iteration, a new subset of the dataset is created with varying prevalence differences (Table 5). For this experiment, we fixed these parameters m=20000, and 50% training samples.

It is observed from Fig. 11a that the OA of both BCHA and RUSBCHA are similar up to 0.3 prevalence difference. However, afterward, there is a minor improvement of OA for BCHA approach in comparison with RUSBCHA approach. From Fig. 11b, it can be observed that both BCHA and RUSBCHA performances are similar up to 0.25 prevalence difference. However, afterward, there is a sharp decline of AA for BCHA in comparison with RUSBCHA.

Fig. 11
figure 11

Overall accuracy and average accuracy of BCHA and RUSBCHA over varying prevalence difference of the two-qutrit data

From the results in Fig. 10 and Fig. 11, it can be observed that the performance of the proposed RUSBCHA approach is consistent (almost a straight line) over varying prevalence differences of data. So, it can be concluded that the performance of RUSBCHA is not heavily affected by the data imbalances.

Referring to our earlier observations, for Fig. 6, the reason for having good AA of proposed RUSBCHA over BCHA, and for Fig. 7, the reason for having similar performances of both RUSBCHA and BCHA can now be justified using Fig. 10 and Fig. 11, respectively. Since the prevalence difference of two-qubit data is 0.8593 our proposed RUSBCHA performs better than BCHA. While the prevalence difference of two-qutrit data is 0.3249, hence, both BCHA and RUSBCHA performances are similar.

Hence, we can conclude that the RUSBCHA can be an alternative to the BCHA approach and also can be a better classifier to deal with highly imbalanced datasets. Overall, the ensemble learning is helpful for better understanding of separability-entanglement problem, when compared to the stand-alone CHA approach.

6 Conclusion

The necessity of a separability-entanglement classifier is well known in the quantum information forum. Although various necessary and sufficient criteria like PPT have been proposed in the past, still, they cannot be generalized for higher dimensions. The ML approaches are vastly exploited in the general data-mining perspective, while the discussions and applications are limited in quantum information processing. Similar to BCHA, we proposed RUSBCHA as an alternative ML-based solution for the quantum separability problem. The proposed RUSBCHA approach for quantum separability problem shown improvements in AE for the two-qubit system, while having similar responses for the two-qutrit systems in comparison with CHA. As the data are highly unbalanced, standard performance measures like OE, AE, F-measure, and G-mean are evaluated. The results suggest incorporating a proper ML approach to classify the separability-entanglement criteria with proper performance matrices. Also, the proposed RUSBCHA can be an alternative to CHA which can deal with the unbalanced dataset that may reduce the over-fitting error of the classifier.

In order to evaluate the effectiveness of the classifier, the feature extraction is unexploited here; however, this can be a further direction of research to improve the classification performance. Also, other ML approaches can be exploited and validated further.