SEindicator
SEindicator
Preprocessing of near-infrared spectra to remove unwanted, i.e., can easily be obtained. One known problem in near-in-
non-related spectral variation and selection of informative wave- frared spectroscopy is spectral variations that are not re-
lengths is considered to be a cru cial step prior to the construction lated to the property of interest.8 This non-related varia-
of a quantitative calibration model. The standard m ethodology
tion is especially important in pharm aceutical applica-
when comparing various preprocessing techniques and selecting dif-
feren t wavelengths is to compare prediction statistics computed with
tions of NIR. In the pharm aceutical industry, spectra are
an independent set of data not used to make the actual calibration often recorded in re ectance m ode. Var ying particle sizes
model. When the erro rs of reference value are large, no such values and var ying compression of, e.g., powders cause non-
are available at all, or only a limited number of samples are avail- related spectral variation. To correct for this variation var-
able, other methods exist to evaluate the preprocess ing method and ious spectral preprocessing techniques are used prior to
wavelength selection. In this work we present a new indicator (SE) calibration, e.g., multiplicative scatter correction 9 (M SC),
that only req uires blank sample spectra, i.e., spectra of samples that offset correction, or Savitzky–Golay 10 derivatives. Anoth-
are m ixtures of the interferin g constituents (everything except the er problem is that if a large part of the recorded spectrum
analyte), a pure analyte spectrum , or alternatively, a sample spec-
does not contain any information about the analyte,
trum where the analyte is present. The indicator is based on com -
puting the net analyte signal of the analyte and the total erro r, i.e., wavelength selection becomes very important. Several
instrumental noise and bias. By com paring the indicator values methods have been proposed for wavelength selec-
when different preprocess ing techniques and wavelength selections tion.11,12 Until recently, it was believed that full spectrum
are applied to the spectra, the optimal preprocessing technique and methods, e.g., PLS, would automatically overcome the
the optimal wavelength selection can be determined without knowl- problem of wavelength selection by setting the regression
edge of reference values, i.e., it minimizes the non-related spectral coef cients for non-informative wavelengths to zero or
variation. The SE indicator is compared to two other indicators that near zero. However, this is not the case and PLS-based
also use net analyte signal computations. To demonstrate the fea-
calibrations can in m any cases be improved by a proper
sibility of the SE indicator, two near-infrared spectral data sets
from the pharmaceutical industry were used, i.e., diffuse re ectance
selection of wavelengths.13
spectra of powder samples and transmission spectra of tablets. Es- The most com mon way of judging whether a prepro-
pecially in pharmaceutical spectroscopic applications, it is expected cessing method is bene cial for the analytical perfor-
beforehand that the non-related spectral variation is rather large mance is to compute the prediction uncertainty for an
and it is important to remove it. The indicator gave excellent results independent test set, i.e., the root mean square error of
with respect to wavelength selection and optimal preprocessing. The prediction (RMSEP) or root mean square error of predic-
SE indicator performs better than the two other indicators, and it tion cross-validation (RM SECV) if only a smaller dataset
is also applicable to other situations where the Beer– Lambert law
is available, and then select the preprocessing method that
is valid.
gives the lowest RMSEP/RM SECV. Some pitfalls with
Index Headings: Spectra l prep rocessin g; W avelength selection;
this method are that it requires a fairly large number of
Near-infrared spectroscopy; Error indicator; Net analyte signal;
Signal-to-noise ratio; Pharmaceutical powders and tablets.
samples, i.e., both calibration and test set data. Secondly,
if the uncertainty of the reference values is high then
judgments are based on reference values with errors. Fi-
INT RODUCTIO N nally, when using PCR or PLS the RM SEP/RMSECV
values are in uenced by the m odel dimensionality. If the
Near-infrared (NIR) spectroscopy is gaining popularity model dimensionality is not estimated correctly with
as a quantitative analytical method in the pharmaceutical some kind of validation technique, then the RM SEP/
industry.1–3 Quality control of incoming raw materials RM SECV values will be misleading and therefore, judg-
and quantitative analysis of intermediate 4,5 and nalized ments of preprocessing method selection or wavelength
products3 are examples of that. Spectra can be recorded selection m ay also be incorrect.
quickly and in a non-invasive m anner and they can be Other m ethods exist to help choose the optimal pre-
combined with a m ultivariate calibration technique, e.g., processing method, i.e., methods using the net analyte
principal component regression 6 (PCR) and partial least- signal (NAS) concept. Net analyte signal is de ned as the
squares regression 7 (PLS), whereby quantitative measures part of a signal that is unique for the analyte of interest.14
Lorber 14 demonstrated how gures of merit, e.g., multi-
Received 19 June 2003; accepted 30 October 2003.
variate sensitivity, signal-to-noise ratio, selectivity, and
* Author to whom correspondence should be sent. E-mail: westerhuis@ limit of detection could be computed from the net analyte
science.uva.nl. signal of the analyte. These gures of m erit can be used
0003-7028 / 04 / 5803-0264$2.00 / 0
264 Volume 58, Number 3, 2004 q 2004 Society for Applied Spectroscop y
APPLIED SPECTROSCOPY
to judge whether a preprocessing method is bene cial for vector. S 2 k is a J 3 L matrix with L spectra of blank
the analytical performance, and they can also be used for samples. In some publications14 pure spectra of the inter-
wavelength selection. Faber 15 used the inverse multivar- fering constituents are used to construct the S 2 k matrix.
iate sensitivity of the analyte to judge whether a certain In our experience this is not the best m ethod, e.g., pure
preprocessing method, e.g., derivative, would improve spectra are not always available and the pure constituent
the predictive ability of the calibration model or not. Xu spectrum m ight differ slightly in shape from the spectral
and Schechter16 developed an error indicator for wave- contribution in a m ixture of interfering constituents. Prac-
length selection. Boelens et al. have also demonstrated tically, the S 2 k matrix is m ost easily constructed by mea-
the usability of NAS for improving the detection limit suring m ixtures of the interfering constituents. S12 k is its
for a spectroscopic process analysis by tuning Savitzky– Moore–Penrose inverse, an L 3 J m atrix. r *k and s*k are
Golay lters. 17 All these m ethodologies use the net an- J 3 1 vectors called the net analyte signal vector of the
alyte signal of the analyte of interest. kth constituent. The net analyte signal for constituent k
In this work we introduce a new error indicator called in any sample can now be computed with Eq. 1.
the signal-to-error indicator (SE). A signal-to-error (SE) Inverse Sensitivity. Various gures of merit 14 can be
value is computed for the analyte when various prepro- computed using the net analyte signal concept, e.g., an-
cessing m ethods and wavelength selections are applied to alyte sensitivity. Faber 15 evaluated the effect of various
the spectra. The highest SE value indicates the optimal preprocessing methods of near-infrared spectra with an
preprocessing and wavelength interval. error indicator based on computation of the inverse of the
We will demonstrate the performance of the inverse analyte sensitivity (a2 1; from here we denote this as
sensitivity indicator, the error indicator, and the signal-to- invSEN) using the net analyte signal concept. Faber used
error indicator with two NIR data sets from different stag- the assumption that the length of the net analyte signal
es in a pharm aceutical tablet production process. The in- vector is proportional to the concentration of the analyte.
dicators are compared to the standard PLS methodology Faber converted the net analyte signal vector into a scalar
and the RMSECV. For the applications presented in this value by taking the Euclidean norm 15 of the net analyte
paper the PLS m ethod is used as a standard to which the signal vector and plotted the value against the analyte
other indicators can be compared. This is possible since concentration of the sample, thereby constructing a uni-
the reference method is known to be accurate. The rst variate calibration plot. The analyte sensitivity can then
set contains spectra of powder samples after mixing the be computed with:
tablet constituents. In the second data set, nalized tablets
using the powder composition from the rst data set are \ r *k,c \
a5 (3)
measured. In both cases the analyte is the active phar- c k,c
maceutical ingredient (API) and the optimal preprocess-
c k,c
ing and wavelength selection is sought. invSEN 5 a 2 1 5 (4)
First some theory about net analyte signal and the \ r *k,c \
method to compute gures of merit will be presented.
where \ r *k,c \ is the norm of the net analyte signal of a
Secondly, the different error indicators will be described
calibration sample with concentration c k,c and the slope
and compared. Then in the Experimental section the in-
of the calibration line a is the sensitivity of analyte k.
strumentation and different data sets used are described
Faber concluded that a preprocessing m ethod is ben-
in detail, and nally, in the Results and Discussion sec-
e cial for the nal predictive ability if the inverse sen-
tion the different error indicators are compared and the
sitivity is decreasing with that particular pretreatment.
results are commented on.
The effect on the inverse sensitivity when doing rst and
second derivatives compared to multiplicative scatter cor-
TH EORY
rected (MSC) spectra was evaluated. This indicator needs
Notation. Boldface capital characters denote matrices, a collection of spectra to span the interference space and
boldface lower-case characters denote vectors, and lower- spectra containing the analyte and their respective refer-
case italic characters denote scalars. \ r \ is the Euclidean ence concentrations of the analyte to compute analyte
norm of the vector r , superscript T denotes the transposed sensitivity.
matrix or vector, and the superscript 1 denotes the Error Indicator. Xu and Schechter16 developed an er-
Moore–Penrose generalized inverse of a matrix. The ma- ror indicator (EI) for wavelength selection. The assump-
trix I J is the J 3 J identity matrix. tion for their EI is that the prediction error in multivariate
Net Analyte Signal. The net analyte signal is de ned analysis is determined by the quality of the corresponding
as the part of a spectrum that is orthogonal to a subspace net analyte signal. By minimizing the relative error in the
spanned by the spectra of all constituents except the an- norm of the NAS, the analytical conditions are optimized
alyte, i.e., all interfering constituents.14 So the net analyte and lower prediction errors are achieved. The EI was de-
signal of analyte k can be found by the following or- ned as follows:
thogonal projection:
var( \ r *k \ 2 \ r *k,true \ ) 1 / 2
r *k 5 (I J 2 S 2 k S 12 k )r k (1) EI 5 (5)
\ r *k,true \
s*k 5 (I J 2 S 2 k S 12 k )s k (2) Due to non-related variations (interferents or baseline
where r k is a J 3 1 vector containing the spectral re- offsets) the norm of the NAS may be affected. The nu-
sponse for a sample including the analyte k measured at merator of the EI describes the variance in the norm of
J wavenumbers. The pure analyte spectrum s k is a J 3 1 the NAS caused by noise in the spectra due to non-related
[1 ]
PROJ blank 5 r Tblank · (11)
1/2
s*k T s*k
2
J s 2 2
s 2
1 1
4 \ r *k \ 2
PROJ blank 5 r Tblank · nas reg (12)
EI 5 (7)
\ r *k \ The error taking into account both bias and noise is
The standard deviation of the spectral noise, s, is found computed by:
Î
from the net analyte signal regression plot (NASRP).
O (PROJ
I
First take the NAS vector of the pure analyte spectrum 2 0) 2
blank, i
s*k and the NAS vector of a sample spectrum containing i5 1
error 5
the analyte r *k . Then the absorbance at each wavelength I
ÎO
j in s*k is plotted against the absorbance in r *k at the same
I
wavelength, for all j 5 1, . . . , J wavelengths in the vec-
tors. This results in the NASRP plot. In the ideal case (PROJ blank, i ) 2
i5 1
with no non-related variation, both NAS vectors will 5 (13)
I
point in the same direction and the points in the NASRP
plot will form a perfectly straight line passing (0, 0). The In the nominator we use I and not I 2 1 because no
assum ption made by Xu and Schechter16 is that at each mean is subtracted so the degrees of freedom are pre-
wavelength the error is norm ally distributed with the served.
same standard deviation, i.e., white noise. A straight line The signal is then computed by projecting the analyte
is tted through the points in the NASRP plot in a least- spectrum on the NAS regression vector and the SE can
square sense and by computing the residual vector, i.e., be computed as the ratio between the signal and the error:
deviation of each of the points from the line, s can be
computed: 18
Signal 5 sTk ·nas reg (14)
!
e Tk,res ·e k,res
s5 (8) Signal
J 21 SE 5 (15)
error
where e k,res is a J 3 1 vector containing the residuals. The
residuals are computed in the following manner: This error indicator needs a collection of blank spectra
to span the interference space and to quantify the error
e k,res 5 r *k 2 s*k c k (9) part plus the pure analyte spectrum. If the pure analyte
spectrum is not available, a sample spectrum containing
c k 5 r *k T ·s*k /\ s*k \ 2
(10)
the analyte can be used.
The error indicator needs a collection of blank spectra Although the error indicator and the signal-to-error in-
to span the interference space, the pure analyte spectrum, dictor seem to be comparable, there are some important
and a sample spectrum containing the analyte. differences. The EI minimizes the difference between the
Signal-to-Error Indicator. In this work we present a length of two vectors, r *k,true and r *k . However, these vec-
new indicator based on the computations of the signal- tors will not necessarily point in the same direction.
to-error (SE). We assum e that the error in the spectra is Therefore, the difference in lengths is not directly related
made of two contributions, i.e., noise and bias. If a certain to errors in concentration. The SE indicator focuses on
preprocessing method or wavelength selection is not re- errors in the direction of the NAS regression vector, i.e.,
moving unwanted interference, then extra blank samples the same direction. The projections on the NAS regres-
may have a small contribution orthogonal to S 2 k when sion vector are used (can also be negative) and not only
they are projected onto the interference space. We com- the lengths of the projected vector. These projections are
pute this contribution as the projection (PROJ blank ) of directly related to the concentrations (cf. Fig. 1).
some extra blank spectra (r blank ) on the norm ed s*k vector, Toolboxes for net analyte signal calibrations are avail-
i.e., norm ed to unit length. We call the norm ed s*k vector able for free download at http://www-its.chem.uva.nl/
for the net analyte signal regression vector nasreg . research/pac/index.html.
results, ve PLS components were selected for the PLS length interval, i.e., I 2 1 and increasing with increasing
model of the whole wavelength range. interval width (see insert in Fig. 5).
The indicator values and the RM SECV were calculated It is important to notice that the selection of prepro-
using the 4000 –10 000 cm 2 1 wavelength region and by cessing method using the whole wavenumber range is not
applying the preprocessing methods listed in Table I. In representative of the results when only a small wave-
Fig. 4 the gain values are depicted for the indicators and length region is used. Therefore, combining preprocess-
RM SECV. The RM SECV shows that the best prepro- ing and wavelength selection, as is done here, seems to
cessing method is rst derivatives using 25 spectral be necessary.
points with a gain value of 2.9. The SE indicator has the Results for Tablet Samples. To span the interference
highest gain for rst derivatives, while the EI indicator space for the invSEN, SE, and EI indicators we used
has the highest gain for second derivatives. The invSEN three blank sample spectra. To compute the invSEN we
indicator has the highest gain for MSC, which is clearly used two samples with a high concentration of API. To
wrong compared to the PLS results. compute the SE we used two sample spectra, i.e., using
W avelength Selection for Powder Samples. Indicator two samples with a high concentration of API as substi-
and RM SECV values were computed for twenty wave- tution for pure analyte tablet spectra that were not avail-
length intervals around 6000 cm 2 1. For all intervals, four able to compute the signal, and an additional three blank
PLS components were used to calculate the RMSECV sample spectra to compute the error. To compute the EI
values. Again the number of PLS components is based four sample spectra with a high API concentration were
on cross-validation results. This was done for all seven used. Two of the sample spectra were used to compute
preprocessing methods and the highest gain value for the the average r *k and the two other sample spectra were
RM SECV was then found to be 5.95 when preprocessing used to compute the average s*k (Eqs. 2 and 3) because
method 5 was used with the wavelength interval from no p ure analy te tablet sam ples are av ailable. T he
5840 –6160 cm 2 1 (Fig. 5). This matched perfectly the SE RM SECV values were calculated using all 18 samples.
indicator that had the highest gain value for the same W hen the RMSECV value was computed the leave-one-
preprocessing method and wavelength interval as the out principle was used because of the limited size of the
PLS method. Also, the EI indicator had the highest gain dataset.
value for preprocessing method 5, but using the wave- Choosing the O ptimal Preprocessing M ethod for
length interval from 5760 –6240 cm 2 1. The shape of the Tablet Samples. Also for the tablet samples, comparison
RM SECV gain curve corresponded well with the shape of the preprocessing methods using a broad spectral range
of the SE gain curve, and also the gain values were all was not a feasible method, i.e., preprocessing combined
above one for the RMSECV and the SE. The gain values with wavelength selection was necessary.
for the EI when applying preprocessing m ethod 5 were W avelength Selection for Tablet Samples. Indicator
only above one for three intervals, i.e., I 2 2, I 2 3, and and RM SECV values were com puted for fteen wave-
I 2 4, while the remaining intervals were less then one, length intervals around 8800 cm 2 1 with all the prepro-
indicating that no preprocessing and using the whole cessing methods described in Table I. All PLS m odels
wavelength region was better for those intervals (Fig. 5). were calculated using four PLS components. The highest
The invSEN indicator was not useful for wavelength se- gain value for the RMSECV was 3.6 when using prepro-
lection using any of the preprocessing m ethods. The cessing method 5, i.e., rst derivatives with 25 spectral
highest gain value for the invSEN was 11.8 using M SC points and the wavelength interval 8620 –8980 cm 2 1 (Fig.
as the preprocessing method and the wavelength interval 6). Also, the SE had the maximum gain value of 3.8 using
from 4000 to 10 000 cm 2 1, and when using all other pre- preprocessing method 5 and the interval 8620 –8980 cm 2 1
processing methods the gain for the invSEN was always (Fig. 6). The shape of the RM SECV and the SE gain
below one, with the lowest value for the smallest wave- curves were fairly similar. As for the powder samples,
CONCLUSION
We have demonstrated a new indicator for choosing
the optimal preprocessing m ethod and conducting wave-
length selection of NIR spectra. The indicator was com-
pared to existing indicators also using net analyte signal
computations and the standard m ethodology using cross-
validation results from a PLS regression m odel. The in-
dicator performed better then the two reference methods
using net analyte signal methodology. The invSEN failed
generally to nd the optimal preprocessing method and
F IG . 6. Optimal preprocessing/wavelength selection. Gain values for was also not useful for wavelength selection. The EI in-
indicators and RMSECV for tablet samples. dicator was developed for wavelength selection but we
tried to use it for selection of optimal preprocessing meth-
od without success for both the powder and tablet sam-
the invSEN was not useful for wavelength selection and
ples. For wavelength selection the EI indicator performed
the gain values were less then one except for the M SC
reasonably for the powder samples and identi ed a few
method. The EI had a m aximum gain at 1.29 when M SC
wavelength intervals that improved the calibration model,
was used for preprocessing and the wavelength interval
but not the optimal selection (Fig. 5). The indicator could
was 8320 –9280 cm 2 1 (not depicted) and was in general
not be used for wavelength selection of the tablet sam-
not useful for wavelength selection of the tablet samples.
ples. The SE indicator identi ed the right preprocessing
The problem with the invSEN indicator is that when
method and also the optimal wavelength selection both
the spectra are preprocessed using rst and second deriv-
for the powder and the tablet samples. For the tablet sam-
atives the Euclidean length of the spectra and subsequent-
ples the right preprocessing m ethod was not obvious and
ly the net analyte signal vectors are lowered. This de-
was identi ed only after subsequent wavelength selection
creases the analyte sensitivity as computed in Eq. 4 with-
was performed (Fig. 6). Thus, in cases where only a few
out regard to the analytical performance of a calibration
samples are available, reference values are determined
model using derivative spectra. In the original publica-
with a high error, or are not available, we recommend
tion, Faber assumed that only white noise is present,
this new indicator.
which is a huge simpli cation of real spectroscopic sys-
In this study the proposed m ethod is only demonstrated
tems in pharmaceutical applications. This m ight also ex-
for re ectance spectra of powder samples and transm it-
plain why the m ethod fails with our examples.
tance spectra of whole tablets. M ore and different spec-
The EI indicator perform ed reasonably well but with
troscopic applications are necessary to corroborate the
failures. Wavelength selection of the tablet samples was
obtained results and to understand the limitations of this
not possible. The reason for the failure with the tablet
method. It might be the case that for different applica-
samples might be that no ‘‘pure analyte tablet’’ was avail-
tions, the proposed indicator will not always be the best
able. In the EI, the net analyte signal vector of a sample
choice for selection of the optimal preprocessing and
and analyte spectra are compared. But as pure analyte
wavelength points.
spectra are not always available and generally not for
tablet samples the EI is not usable for this sample type.
ACK NOW LEDGM ENT
The validation of the SE method is only performed on
the zero concentration level. Therefore, it can be expected Novo Nordisk, Corporate Research Affairs (CORA) sponsored this
work as a part of E.T.S. Skibsted’s Ph.D. project.
that the m ethod will work better for low concentrations.
During the work we discovered that a good selection of
blank samples is the ‘‘key’’ to the SE indicator. For the 1. M . Blanco, H. Iturriaga, S. Maspoch, and C. Pezuela, Analyst
powder samples we had measured each of the ve blank (Cambridge, U.K.) 123, 135 (1998).
2. M . Blanco, J. Coello, A. Eustaquio, H. Iturriaga, and S. Maspoch,
samples eight times, giving forty blank spectra. Among Anal. Chim. Acta 392, 237 (1999).
these spectra we picked a few spectra to span the inter- 3. M . Blanco, J. Coello, H. Iturriaga, S. M aspoch, and D. Serrano,
ference space and a larger portion to quantify the error. Analyst (Cambridge, U.K.) 123, 2307 (1998).
We recommend that as many blank samples as possible 4. R. D. Maesschalck, F. C. Sánchez, D. L. M assart, P. Doherty, and
P. Hailey, Appl. Spectrosc. 52, 725 (1998).
be m easured using repeated measurem ents, and in that
5. F. C. Sánchez, J. Toft, B. Bogaert, S. S. Dive, and P. Hailey, Fre-
manner, instrumental noise and baseline drift are includ- senius’ J. Anal. Chem. 352, 771 (1995).
ed. This is easy to do in most industrial applications, but 6. H. Martens and T. Næs, Trends Anal. Chem. 3, 204 (1984).
might be more dif cult for environmental products. Also 7. H. Martens and T. Næs, Multivariate Calibration (John Wiley and
reposition the samples and for powder samples, shake the Sons, Chichester, 1989).
8. H. Swierenga, A. P. de Weijer, R. J. van Wijk, and L. M. C. Buy-
samples. In that manner, heterogeneous samples are best dens, Chemom. Intell. Lab. Syst. 49, 1 (1999).
measured. 9. P. Geladi, D. McDougall, and H. Martens, Appl. Spectrosc. 39, 491
A problematic issue for all NAS m ethods is that it is (1985).