Dersler

View on GitHub

EEM-272 Makine Öğrenmesine Giriş

Train Test Split

import pandas as pd
import matplotlib.pyplot as plt

Veri dosyasını indirmek için tıklayınız.

df = pd.read_csv("04_carprices.csv")
df.head()
Mileage Age(yrs) Sell Price($)
0 69000 6 18000
1 35000 3 34000
2 57000 5 26100
3 22500 2 40000
4 46000 4 31500
data=df.to_numpy()
data[:3]
array([[69000,     6, 18000],
       [35000,     3, 34000],
       [57000,     5, 26100]], dtype=int64)
milage=data[:,0]
age=data[:,1]
price=data[:,2]
plt.scatter(milage,price)
plt.show()

png

plt.scatter(age,price)
plt.show()

png

X = data[:,0:2]
y = data[:,2]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3) 
X_train.shape, y_train.shape
((14, 2), (14,))
X_test.shape, y_test.shape
((6, 2), (6,))

Linear regression

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
X_test
array([[82450,     7],
       [25400,     3],
       [69000,     5],
       [87600,     8],
       [46000,     4],
       [91000,     8]], dtype=int64)
reg.predict(X_test)
array([17224.19150145, 38357.0955021 , 21908.68515977, 15496.47763339,
       30601.10592035, 14169.86679672])
y_test
array([19400, 35000, 19700, 12800, 31500, 12000], dtype=int64)
# R2 score
reg.score(X_test, y_test)
0.9260837288108493

$R^2$ metriği 1’e ne kadar yakınsa model o kadar iyi sonuç vermiş demektir.

\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\]

$y_i$: gerçek değerler
$\hat{y}_i$: tahminler
$\bar{y}$: gerçek değerlerin ortalaması

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
\[\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\] \[\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\] \[\mathrm{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\]
# Tahminler
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

# --- TRAIN PERFORMANSI ---
print("TRAIN PERFORMANSI")
print("R2:", r2_score(y_train, y_train_pred))
print("MAE:", mean_absolute_error(y_train, y_train_pred))
print("MSE:", mean_squared_error(y_train, y_train_pred))
print("MAPE:", mean_absolute_percentage_error(y_train, y_train_pred))


print("\n----------------------\n")

# --- TEST PERFORMANSI ---
print("TEST PERFORMANSI")
print("R2:", r2_score(y_test, y_test_pred))
print("MAE:", mean_absolute_error(y_test, y_test_pred))
print("MSE:", mean_squared_error(y_test, y_test_pred))
print("MAPE:", mean_absolute_percentage_error(y_test, y_test_pred))
TRAIN PERFORMANSI
R2: 0.9358120111765775
MAE: 1436.3758569034203
MSE: 3060470.2075112266
MAPE: 0.0669296461160506

----------------------

TEST PERFORMANSI
R2: 0.9341950255861509
MAE: 2071.1044298593897
MSE: 5462347.541766587
MAPE: 0.09107989849074923

random_state argümanı ile her defasında aynı bölme işlemi sağlanabilir.

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)
X_test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)
X_test
array([[72000,     6],
       [83000,     7],
       [59000,     5],
       [52000,     5],
       [22500,     2],
       [87600,     8]], dtype=int64)
reg = LinearRegression()
reg.fit(X_train, y_train)

# Tahminler
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)
# --- TRAIN PERFORMANSI ---
print("TRAIN PERFORMANSI")
print("R2:", r2_score(y_train, y_train_pred))
print("MAE:", mean_absolute_error(y_train, y_train_pred))
print("MSE:", mean_squared_error(y_train, y_train_pred))
print("MAPE:", mean_absolute_percentage_error(y_train, y_train_pred))

print("\n----------------------\n")

# --- TEST PERFORMANSI ---
print("TEST PERFORMANSI")
print("R2:", r2_score(y_test, y_test_pred))
print("MAE:", mean_absolute_error(y_test, y_test_pred))
print("MSE:", mean_squared_error(y_test, y_test_pred))
print("MAPE:", mean_absolute_percentage_error(y_test, y_test_pred))
TRAIN PERFORMANSI
R2: 0.9384897023666824
MAE: 1443.9521363065562
MSE: 2932798.438034618
MAPE: 0.06829052176926063

----------------------

TEST PERFORMANSI
R2: 0.9212422483776329
MAE: 2285.4999884929643
MSE: 6537533.291388397
MAPE: 0.0978060876864723

Ölçeklendirme

from sklearn.preprocessing import StandardScaler, MinMaxScaler

StandardScaler

\[x' = \frac{x - \mu}{\sigma}\]

μ: verinin ortalaması
σ: verinin standart sapması
Anlamı: Bu değer ortalamadan kaç standart sapma uzaklıkta?

MinMaxScaler

\[x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\]

Veri 0-1 arasına yerlerşir.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=10
)
# --------- ÖLÇEKLENDİRME (SADECE TRAIN'E FIT) ---------
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)   # fit + transform
X_test_scaled = scaler.transform(X_test)         # SADECE transform
# --------- MODEL ---------
reg = LinearRegression()
reg.fit(X_train_scaled, y_train)
# --------- TAHMİNLER ---------
y_train_pred = reg.predict(X_train_scaled)
y_test_pred = reg.predict(X_test_scaled)
# --------- TRAIN PERFORMANSI ---------
print("TRAIN PERFORMANSI")
print("R2:", r2_score(y_train, y_train_pred))
print("MAE:", mean_absolute_error(y_train, y_train_pred))
print("MSE:", mean_squared_error(y_train, y_train_pred))
print("MAPE:", mean_absolute_percentage_error(y_train, y_train_pred))

print("\n----------------------\n")

# --------- TEST PERFORMANSI ---------
print("TEST PERFORMANSI")
print("R2:", r2_score(y_test, y_test_pred))
print("MAE:", mean_absolute_error(y_test, y_test_pred))
print("MSE:", mean_squared_error(y_test, y_test_pred))
print("MAPE:", mean_absolute_percentage_error(y_test, y_test_pred))
TRAIN PERFORMANSI
R2: 0.9384897023666824
MAE: 1443.9521363065564
MSE: 2932798.4380346197
MAPE: 0.06829052176926066

----------------------

TEST PERFORMANSI
R2: 0.921242248377633
MAE: 2285.499988492963
MSE: 6537533.291388389
MAPE: 0.09780608768647225

Kaynak: