MultipleRegression
March 14, 2024
1 Regressão múltipla
• E se mais de uma variável influenciar o que está sendo interessado?
• Exemplo: predizer o preço de um carro com base em seus vários atributos.
• Se também houver multiplas variáveis dependentes - coisas que estão tentando ser previstas
- isso é uma regressão multivariável.
1.0.1 Ainda usa least squares
• A unica diferença é que agora terá coeficientes diferentes para cada fator.
• Esses coeficientes implicam no quão importante cada fator realmente é, se os dados estiverem
normalizados.
• Se é livrado de variáveis que não influenciam.
• Ainda pode medir a adequação com r-squared.
• Precisa assumir que os diferentes fatores não são dependentes uns dos outros.
1.1 Pratica
[2]: import pandas as pd
df = pd.read_excel('cars.xls')
[3]: %matplotlib inline
import numpy as np
df1=df[['Mileage','Price']]
bins = np.arange(0,50000,10000)
groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()
print(groups.head())
groups['Price'].plot.line()
Mileage Price
Mileage
(0, 10000] 5588.629630 24096.714451
(10000, 20000] 15898.496183 21955.979607
(20000, 30000] 24114.407104 20278.606252
(30000, 40000] 33610.338710 19463.670267
/tmp/ipykernel_12254/679127490.py:5: FutureWarning: The default of
observed=False is deprecated and will be changed to True in a future version of
1
pandas. Pass observed=False to retain current behavior or observed=True to adopt
the future default and silence this warning.
groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()
[3]: <Axes: xlabel='Mileage'>
[4]: import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage',␣
↪'Cylinder', 'Doors']].values)
X = sm.add_constant(X)
print (X)
est = sm.OLS(y, X).fit()
2
print(est.summary())
const Mileage Cylinder Doors
0 1.0 -1.417485 0.52741 0.556279
1 1.0 -1.305902 0.52741 0.556279
2 1.0 -0.810128 0.52741 0.556279
3 1.0 -0.426058 0.52741 0.556279
4 1.0 0.000008 0.52741 0.556279
.. … … … …
799 1.0 -0.439853 0.52741 0.556279
800 1.0 -0.089966 0.52741 0.556279
801 1.0 0.079605 0.52741 0.556279
802 1.0 0.750446 0.52741 0.556279
803 1.0 1.932565 0.52741 0.556279
[804 rows x 4 columns]
OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.360
Model: OLS Adj. R-squared: 0.358
Method: Least Squares F-statistic: 150.0
Date: Thu, 14 Mar 2024 Prob (F-statistic): 3.95e-77
Time: 16:21:34 Log-Likelihood: -8356.7
No. Observations: 804 AIC: 1.672e+04
Df Residuals: 800 BIC: 1.674e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.134e+04 279.405 76.388 0.000 2.08e+04 2.19e+04
Mileage -1272.3412 279.567 -4.551 0.000 -1821.112 -723.571
Cylinder 5587.4472 279.527 19.989 0.000 5038.754 6136.140
Doors -1404.5513 279.446 -5.026 0.000 -1953.085 -856.018
==============================================================================
Omnibus: 157.913 Durbin-Watson: 0.069
Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529
Skew: 1.278 Prob(JB): 1.20e-56
Kurtosis: 4.074 Cond. No. 1.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
/tmp/ipykernel_12254/1575598944.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
3
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage',
'Cylinder', 'Doors']].values)
[5]: y.groupby(df.Doors).mean()
[5]: Doors
2 23807.135520
4 20580.670749
Name: Price, dtype: float64
[11]: scaled = scale.transform([[45000, 8, 4]])
scaled = np.insert(scaled[0], 0, 1)
print(scaled)
predicted = est.predict(scaled)
print(predicted)
[1. 3.07256589 1.96971667 0.55627894]
[27658.15707316]