import pandas as pd
Sometimes, we may need to use a specific encoding:
encoding = "ISO-8859-1"
encoding = "utf-8"
football = pd.read_csv("football_2.csv", encoding = "ISO-8859-1")
football.head()
ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | ... | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 207439 | L. Paredes | 24 | https://cdn.sofifa.org/players/4/19/207439.png | Argentina | https://cdn.sofifa.org/flags/52.png | 80 | 85 | NaN | https://cdn.sofifa.org/flags/52.png | ... | 74.0 | 73.0 | 75.0 | 72.0 | 9.0 | 14.0 | 6.0 | 9.0 | 10.0 | NaN |
1 | 156713 | A. Granqvist | 33 | https://cdn.sofifa.org/players/4/19/156713.png | Sweden | https://cdn.sofifa.org/flags/46.png | 80 | 80 | NaN | https://cdn.sofifa.org/flags/46.png | ... | 78.0 | 82.0 | 83.0 | 79.0 | 7.0 | 9.0 | 12.0 | 10.0 | 15.0 | NaN |
2 | 229909 | A. Lunev | 26 | https://cdn.sofifa.org/players/4/19/229909.png | Russia | https://cdn.sofifa.org/flags/40.png | 79 | 81 | NaN | https://cdn.sofifa.org/flags/40.png | ... | 69.0 | 18.0 | 20.0 | 12.0 | 80.0 | 73.0 | 65.0 | 77.0 | 85.0 | NaN |
3 | 187347 | I. Smolnikov | 29 | https://cdn.sofifa.org/players/4/19/187347.png | Russia | https://cdn.sofifa.org/flags/40.png | 79 | 79 | NaN | https://cdn.sofifa.org/flags/40.png | ... | 73.0 | 76.0 | 76.0 | 80.0 | 7.0 | 12.0 | 10.0 | 8.0 | 15.0 | NaN |
4 | 153260 | Hilton | 40 | https://cdn.sofifa.org/players/4/19/153260.png | Brazil | https://cdn.sofifa.org/flags/54.png | 78 | 78 | Montpellier HSC | https://cdn.sofifa.org/teams/2/light/70.png | ... | 70.0 | 83.0 | 77.0 | 76.0 | 12.0 | 7.0 | 11.0 | 12.0 | 13.0 | NaN |
5 rows × 88 columns
Variable names. Hard to read without the index.
football.columns.values.tolist()
['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special', 'Preferred Foot', 'International Reputation', 'Weak Foot', 'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position', 'Jersey Number', 'Joined', 'Loaned From', 'Contract Valid Until', 'Height', 'Weight', 'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes', 'Release Clause']
football_variables_df = pd.DataFrame(football.columns.values, columns = ["Variables"])
football_variables_df
Variables | |
---|---|
0 | ID |
1 | Name |
2 | Age |
3 | Photo |
4 | Nationality |
... | ... |
83 | GKHandling |
84 | GKKicking |
85 | GKPositioning |
86 | GKReflexes |
87 | Release Clause |
88 rows × 1 columns
Disply all rows
print(football_variables_df.to_string())
Variables 0 ID 1 Name 2 Age 3 Photo 4 Nationality 5 Flag 6 Overall 7 Potential 8 Club 9 Club Logo 10 Value 11 Wage 12 Special 13 Preferred Foot 14 International Reputation 15 Weak Foot 16 Skill Moves 17 Work Rate 18 Body Type 19 Real Face 20 Position 21 Jersey Number 22 Joined 23 Loaned From 24 Contract Valid Until 25 Height 26 Weight 27 LS 28 ST 29 RS 30 LW 31 LF 32 CF 33 RF 34 RW 35 LAM 36 CAM 37 RAM 38 LM 39 LCM 40 CM 41 RCM 42 RM 43 LWB 44 LDM 45 CDM 46 RDM 47 RWB 48 LB 49 LCB 50 CB 51 RCB 52 RB 53 Crossing 54 Finishing 55 HeadingAccuracy 56 ShortPassing 57 Volleys 58 Dribbling 59 Curve 60 FKAccuracy 61 LongPassing 62 BallControl 63 Acceleration 64 SprintSpeed 65 Agility 66 Reactions 67 Balance 68 ShotPower 69 Jumping 70 Stamina 71 Strength 72 LongShots 73 Aggression 74 Interceptions 75 Positioning 76 Vision 77 Penalties 78 Composure 79 Marking 80 StandingTackle 81 SlidingTackle 82 GKDiving 83 GKHandling 84 GKKicking 85 GKPositioning 86 GKReflexes 87 Release Clause
print(football.dtypes.to_string())
ID int64 Name object Age int64 Photo object Nationality object Flag object Overall int64 Potential int64 Club object Club Logo object Value int64 Wage int64 Special int64 Preferred Foot object International Reputation float64 Weak Foot float64 Skill Moves float64 Work Rate object Body Type object Real Face object Position object Jersey Number float64 Joined object Loaned From object Contract Valid Until object Height object Weight object LS object ST object RS object LW object LF object CF object RF object RW object LAM object CAM object RAM object LM object LCM object CM object RCM object RM object LWB object LDM object CDM object RDM object RWB object LB object LCB object CB object RCB object RB object Crossing float64 Finishing float64 HeadingAccuracy float64 ShortPassing float64 Volleys float64 Dribbling float64 Curve float64 FKAccuracy float64 LongPassing float64 BallControl float64 Acceleration float64 SprintSpeed float64 Agility float64 Reactions float64 Balance float64 ShotPower float64 Jumping float64 Stamina float64 Strength float64 LongShots float64 Aggression float64 Interceptions float64 Positioning float64 Vision float64 Penalties float64 Composure float64 Marking float64 StandingTackle float64 SlidingTackle float64 GKDiving float64 GKHandling float64 GKKicking float64 GKPositioning float64 GKReflexes float64 Release Clause object
Filter for strikers only.
football_2 = football[football["Position"] == "ST"]
football_2.head()
ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | ... | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 187607 | A. Dzyuba | 29 | https://cdn.sofifa.org/players/4/19/187607.png | Russia | https://cdn.sofifa.org/flags/40.png | 78 | 78 | NaN | https://cdn.sofifa.org/flags/40.png | ... | 70.0 | 21.0 | 15.0 | 19.0 | 15.0 | 12.0 | 11.0 | 11.0 | 8.0 | NaN |
8 | 183389 | G. Sio | 29 | https://cdn.sofifa.org/players/4/19/183389.png | Ivory Coast | https://cdn.sofifa.org/flags/108.png | 77 | 77 | NaN | https://cdn.sofifa.org/flags/108.png | ... | 72.0 | 40.0 | 18.0 | 12.0 | 15.0 | 9.0 | 10.0 | 15.0 | 16.0 | NaN |
18 | 245683 | K. Fofana | 26 | https://cdn.sofifa.org/players/4/19/245683.png | Ivory Coast | https://cdn.sofifa.org/flags/108.png | 75 | 75 | NaN | https://cdn.sofifa.org/flags/108.png | ... | 83.0 | 23.0 | 37.0 | 46.0 | 7.0 | 11.0 | 7.0 | 11.0 | 14.0 | NaN |
45 | 190461 | B. Sigur̡arson | 27 | https://cdn.sofifa.org/players/4/19/190461.png | Iceland | https://cdn.sofifa.org/flags/24.png | 73 | 74 | NaN | https://cdn.sofifa.org/flags/24.png | ... | 76.0 | 31.0 | 39.0 | 24.0 | 9.0 | 12.0 | 10.0 | 15.0 | 16.0 | NaN |
65 | 225900 | J. Sambenito | 26 | https://cdn.sofifa.org/players/4/19/225900.png | Paraguay | https://cdn.sofifa.org/flags/58.png | 71 | 74 | NaN | https://cdn.sofifa.org/flags/58.png | ... | 74.0 | 15.0 | 16.0 | 16.0 | 15.0 | 16.0 | 15.0 | 7.0 | 7.0 | NaN |
5 rows × 88 columns
Filter for the required varables.
import numpy as np
football_3 = football_2.iloc[:, np.r_[2, 13, 18, 67, 68, 73, 75, 78, 11]]
football_3.head()
Age | Preferred Foot | Body Type | Balance | ShotPower | Aggression | Positioning | Composure | Wage | |
---|---|---|---|---|---|---|---|---|---|
5 | 29 | Right | Stocky | 32.0 | 78.0 | 75.0 | 78.0 | 70.0 | 1105 |
8 | 29 | Left | Normal | 73.0 | 77.0 | 77.0 | 76.0 | 72.0 | 2138 |
18 | 26 | Right | Normal | 60.0 | 78.0 | 67.0 | 72.0 | 83.0 | 3875 |
45 | 27 | Right | Normal | 76.0 | 68.0 | 73.0 | 73.0 | 76.0 | 3661 |
65 | 26 | Right | Lean | 64.0 | 73.0 | 49.0 | 75.0 | 74.0 | 2445 |
# Or simply (if no ranges are used)
football_2.iloc[:, [2, 13, 18, 67, 68, 73, 75, 78, 11]]
Age | Preferred Foot | Body Type | Balance | ShotPower | Aggression | Positioning | Composure | Wage | |
---|---|---|---|---|---|---|---|---|---|
5 | 29 | Right | Stocky | 32.0 | 78.0 | 75.0 | 78.0 | 70.0 | 1105 |
8 | 29 | Left | Normal | 73.0 | 77.0 | 77.0 | 76.0 | 72.0 | 2138 |
18 | 26 | Right | Normal | 60.0 | 78.0 | 67.0 | 72.0 | 83.0 | 3875 |
45 | 27 | Right | Normal | 76.0 | 68.0 | 73.0 | 73.0 | 76.0 | 3661 |
65 | 26 | Right | Lean | 64.0 | 73.0 | 49.0 | 75.0 | 74.0 | 2445 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18181 | 19 | Right | Lean | 64.0 | 67.0 | 38.0 | 61.0 | 52.0 | 3399 |
18184 | 21 | Right | Stocky | 70.0 | 64.0 | 32.0 | 56.0 | 51.0 | 9389 |
18188 | 21 | Right | Normal | 53.0 | 61.0 | 62.0 | 60.0 | 61.0 | 10780 |
18190 | 19 | Right | Normal | 68.0 | 61.0 | 51.0 | 67.0 | 62.0 | 10121 |
18203 | 16 | Right | Lean | 60.0 | 61.0 | 36.0 | 62.0 | 63.0 | 8358 |
2152 rows × 9 columns
from sklearn.model_selection import train_test_split
Define predictors and target variable.
Creating dummies applies to categorical variables only.
If a prefix is desired:
X = pd.get_dummies(X, prefix_sep = 'dummy', drop_first = True)
X = football_3.drop(columns = ["Wage"])
# Get dummies for the caterogical variables
X = pd.get_dummies(X, drop_first = True)
y = football_3["Wage"]
X
Age | Balance | ShotPower | Aggression | Positioning | Composure | Preferred Foot_Right | Body Type_Lean | Body Type_Normal | Body Type_Stocky | |
---|---|---|---|---|---|---|---|---|---|---|
5 | 29 | 32.0 | 78.0 | 75.0 | 78.0 | 70.0 | 1 | 0 | 0 | 1 |
8 | 29 | 73.0 | 77.0 | 77.0 | 76.0 | 72.0 | 0 | 0 | 1 | 0 |
18 | 26 | 60.0 | 78.0 | 67.0 | 72.0 | 83.0 | 1 | 0 | 1 | 0 |
45 | 27 | 76.0 | 68.0 | 73.0 | 73.0 | 76.0 | 1 | 0 | 1 | 0 |
65 | 26 | 64.0 | 73.0 | 49.0 | 75.0 | 74.0 | 1 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18181 | 19 | 64.0 | 67.0 | 38.0 | 61.0 | 52.0 | 1 | 1 | 0 | 0 |
18184 | 21 | 70.0 | 64.0 | 32.0 | 56.0 | 51.0 | 1 | 0 | 0 | 1 |
18188 | 21 | 53.0 | 61.0 | 62.0 | 60.0 | 61.0 | 1 | 0 | 1 | 0 |
18190 | 19 | 68.0 | 61.0 | 51.0 | 67.0 | 62.0 | 1 | 0 | 1 | 0 |
18203 | 16 | 60.0 | 61.0 | 36.0 | 62.0 | 63.0 | 1 | 1 | 0 | 0 |
2152 rows × 10 columns
y
5 1105 8 2138 18 3875 45 3661 65 2445 ... 18181 3399 18184 9389 18188 10780 18190 10121 18203 8358 Name: Wage, Length: 2152, dtype: int64
Split the dataset
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.3, random_state = 666)
Check.
train_X.head()
Age | Balance | ShotPower | Aggression | Positioning | Composure | Preferred Foot_Right | Body Type_Lean | Body Type_Normal | Body Type_Stocky | |
---|---|---|---|---|---|---|---|---|---|---|
13946 | 26 | 69.0 | 71.0 | 64.0 | 72.0 | 72.0 | 1 | 1 | 0 | 0 |
7711 | 22 | 85.0 | 69.0 | 63.0 | 47.0 | 56.0 | 1 | 0 | 0 | 1 |
8402 | 25 | 65.0 | 52.0 | 22.0 | 55.0 | 52.0 | 1 | 1 | 0 | 0 |
13651 | 26 | 59.0 | 76.0 | 80.0 | 75.0 | 72.0 | 0 | 0 | 1 | 0 |
1625 | 28 | 55.0 | 70.0 | 55.0 | 61.0 | 71.0 | 0 | 1 | 0 | 0 |
len(train_X)
1506
train_y.head()
13946 22512 7711 6760 8402 5377 13651 13711 1625 10521 Name: Wage, dtype: int64
len(train_y)
1506
valid_X.head()
Age | Balance | ShotPower | Aggression | Positioning | Composure | Preferred Foot_Right | Body Type_Lean | Body Type_Normal | Body Type_Stocky | |
---|---|---|---|---|---|---|---|---|---|---|
7882 | 30 | 59.0 | 67.0 | 39.0 | 63.0 | 52.0 | 1 | 0 | 1 | 0 |
14555 | 24 | 55.0 | 80.0 | 59.0 | 71.0 | 65.0 | 1 | 0 | 0 | 1 |
16210 | 32 | 61.0 | 70.0 | 52.0 | 75.0 | 67.0 | 1 | 0 | 0 | 1 |
15847 | 24 | 60.0 | 75.0 | 49.0 | 71.0 | 70.0 | 1 | 0 | 1 | 0 |
12382 | 27 | 67.0 | 76.0 | 55.0 | 78.0 | 77.0 | 1 | 0 | 1 | 0 |
len(valid_X)
646
valid_y.head()
7882 5628 14555 28875 16210 6941 15847 10144 12382 45877 Name: Wage, dtype: int64
len(valid_y)
646
import sklearn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(train_X, train_y)
LinearRegression()
train_y_pred = model.predict(train_X)
train_y_pred
array([24239.85483562, 3778.75482905, -4114.30802593, ..., 18813.39087789, 21533.92321643, 22278.55484414])
train_y_pred_df = pd.DataFrame(train_y_pred, columns = ["Training_Prediction"])
train_y_pred_df
Training_Prediction | |
---|---|
0 | 24239.854836 |
1 | 3778.754829 |
2 | -4114.308026 |
3 | 29762.024484 |
4 | 15704.772118 |
... | ... |
1501 | 25387.715396 |
1502 | 21812.772659 |
1503 | 18813.390878 |
1504 | 21533.923216 |
1505 | 22278.554844 |
1506 rows × 1 columns
print("model intercept: ", model.intercept_)
print("model coefficients: ", model.coef_)
print("Model score: ", model.score(train_X, train_y))
model intercept: 293888.9838689667 model coefficients: [-9.16152410e+02 4.85334103e+01 4.63702248e+02 4.80950017e+01 6.32137998e+02 3.74975144e+02 -2.20174023e+03 -3.55489317e+05 -3.56667999e+05 -3.57613010e+05] Model score: 0.4586287207175519
Coefficients, easier to read.
print(pd.DataFrame({"Predictor": train_X.columns, "Coefficient": model.coef_}))
Predictor Coefficient 0 Age -916.152410 1 Balance 48.533410 2 ShotPower 463.702248 3 Aggression 48.095002 4 Positioning 632.137998 5 Composure 374.975144 6 Preferred Foot_Right -2201.740231 7 Body Type_Lean -355489.317404 8 Body Type_Normal -356667.999143 9 Body Type_Stocky -357613.009856
Get the RMSE for training set
mse_train = sklearn.metrics.mean_squared_error(train_y, train_y_pred)
mse_train
261404234.2377431
import math
rmse_train = math.sqrt(mse_train)
rmse_train
16168.000316605116
train_y.describe()
count 1506.000000 mean 12698.381142 std 21981.278007 min 1290.000000 25% 4692.500000 50% 6544.000000 75% 12364.250000 max 407609.000000 Name: Wage, dtype: float64
If using the dmba package:
pip install dmba
or
conda install -c conda-forge dmba
Then load the library
import dmba
from dmba import regressionSummary
import dmba
from dmba import regressionSummary
regressionSummary(train_y, train_y_pred)
Regression statistics Mean Error (ME) : 0.0000 Root Mean Squared Error (RMSE) : 16168.0003 Mean Absolute Error (MAE) : 8475.7614 Mean Percentage Error (MPE) : -32.8265 Mean Absolute Percentage Error (MAPE) : 111.6103
Residuals.
train_residuals = train_y - train_y_pred
train_residuals
13946 -1727.854836 7711 2981.245171 8402 9491.308026 13651 -16051.024484 1625 -5183.772118 ... 12759 -3215.715396 17284 -15227.772659 1016 -7666.390878 16984 -9871.923216 16744 4022.445156 Name: Wage, Length: 1506, dtype: float64
type(train_residuals)
pandas.core.series.Series
import matplotlib.pyplot as plt
plt.hist(train_residuals, bins = 30)
plt.title("Residuals for Training")
plt.show()
train_residuals_df = train_residuals.to_frame(name = "Wage_Residuals")
train_residuals_df
Wage_Residuals | |
---|---|
13946 | -1727.854836 |
7711 | 2981.245171 |
8402 | 9491.308026 |
13651 | -16051.024484 |
1625 | -5183.772118 |
... | ... |
12759 | -3215.715396 |
17284 | -15227.772659 |
1016 | -7666.390878 |
16984 | -9871.923216 |
16744 | 4022.445156 |
1506 rows × 1 columns
import matplotlib.pyplot as plt
plt.hist(train_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Training")
plt.show()
Normality
import numpy as np
from scipy.stats import shapiro
shapiro(train_y)
ShapiroResult(statistic=0.38412952423095703, pvalue=0.0)
shapiro(train_residuals)
ShapiroResult(statistic=0.5784118175506592, pvalue=0.0)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_df = pd.DataFrame()
vif_df["features"] = train_X.columns
vif_df["VIF"] = [variance_inflation_factor(train_X.values, i) for i in range(train_X.shape[1])]
print(vif_df)
features VIF 0 Age 48.741902 1 Balance 33.364691 2 ShotPower 155.270485 3 Aggression 18.482052 4 Positioning 184.558829 5 Composure 115.774481 6 Preferred Foot_Right 7.714600 7 Body Type_Lean 32.118906 8 Body Type_Normal 59.412529 9 Body Type_Stocky 11.108926
valid_y_pred = model.predict(valid_X)
valid_y_pred
array([ 2.66530045e+03, 2.39448807e+04, 1.52116766e+04, 2.42079729e+04, 2.96013138e+04, 1.24219313e+04, 7.03030428e+03, 1.95690670e+04, 1.65262949e+04, -1.47780320e+03, 1.30968036e+04, 1.46015036e+04, 1.65416124e+04, 1.23451712e+04, 2.41133395e+04, 3.44640274e+04, 1.67102936e+04, 1.29909076e+04, 3.97181605e+03, 2.20783678e+04, 5.38322832e+03, 2.09776571e+04, 2.08805388e+04, 2.34148863e+04, 1.06999140e+04, -5.56234183e+03, 1.17865372e+04, 2.23538133e+04, 1.43890326e+04, 1.75954292e+04, -1.22354489e+02, 1.29628192e+04, 8.41552263e+03, 2.27889941e+04, 2.31121184e+03, 1.94168827e+04, -6.96486103e+03, 1.37996432e+04, 3.27861195e+03, 2.86742008e+04, 2.97875564e+04, -6.32773021e+03, 2.31928580e+04, 3.04579799e+03, 1.54775276e+04, 2.42101352e+04, -5.84285114e+03, -1.29340208e+03, 4.06247257e+03, 1.80227373e+04, 2.17889372e+04, 9.70873486e+03, 8.02344053e+03, 1.92883880e+04, 1.77181908e+04, 2.18037493e+04, -4.33997863e+03, 1.56144785e+04, 9.71271318e+03, 2.05954475e+04, 4.75178565e+03, 2.30383707e+04, 1.02677166e+04, 2.35779602e+04, 9.87039872e+03, 1.68866474e+04, -1.31118203e+03, 8.87430877e+03, 3.24007684e+03, 3.23350289e+04, 1.53211409e+04, 1.07576794e+03, 1.62358353e+03, 1.73480299e+04, 1.79438503e+04, 2.49372041e+04, 2.00529623e+04, 1.68100187e+04, 8.56136513e+03, 3.46264372e+04, 2.54399862e+04, 4.68426335e+04, 9.91675364e+03, 1.83355436e+04, 3.54108387e+04, 1.70138955e+04, 1.46236932e+04, 1.02365282e+04, 1.62242848e+04, 2.37720957e+04, 3.08256560e+04, 1.59745648e+04, 1.59032676e+04, 3.19791193e+04, 2.04379622e+04, 1.55905542e+04, 2.37955466e+04, 1.04128105e+04, 9.19329209e+03, 4.21540077e+04, 1.27777969e+04, 2.41864412e+04, 2.14468738e+04, 8.06144737e+03, 2.36390151e+04, 1.44857796e+04, 2.78377031e+04, 2.65205841e+04, 2.38842489e+03, 4.01141506e+04, 1.19248133e+04, 1.74026797e+04, -3.55892891e+03, 1.72208335e+04, 1.22610883e+04, -5.54908320e+03, 9.56791282e+03, 3.64233850e+03, 3.18218976e+04, 1.47101698e+04, 3.73624918e+03, 5.33297175e+03, 2.29637658e+04, 2.10020910e+03, 2.39209825e+04, 4.18162608e+04, 3.26198242e+04, 3.84999758e+04, 2.36460768e+03, 2.00953034e+03, 2.70110133e+04, -2.01391023e+03, -2.20258250e+03, 1.69663027e+04, 3.12074612e+04, 2.05483828e+02, -3.60379347e+03, 1.35194619e+03, -2.64933746e+03, 8.64132573e+03, 1.58664227e+04, 1.67561284e+04, 1.34151485e+04, 3.70881499e+03, 4.13986671e+03, 3.08144727e+04, 1.23279327e+04, -2.22148139e+01, 5.57685656e+03, 2.56361861e+04, 2.59934302e+04, 1.34265919e+03, 2.65983638e+04, 1.18370810e+04, 1.09183425e+04, 1.43909810e+04, 3.88503740e+03, -2.33173138e+03, 2.02760295e+04, 1.87326850e+04, 1.96038255e+04, -9.93766117e+03, 7.47558360e+03, 1.66588172e+04, 1.16956311e+04, 3.90959788e+03, 7.51416417e+03, 2.06412445e+04, 1.75336594e+04, 1.33414610e+04, 1.96836251e+03, -4.82880254e+03, -3.71099746e+02, 7.66629848e+03, 6.14881071e+03, 1.23687003e+04, 1.21231323e+04, 1.35733402e+04, 1.83551007e+04, 1.70539379e+04, 3.02524002e+04, -1.07159337e+04, 6.01304553e+03, 1.14180403e+04, -3.34690338e+03, 1.00622718e+04, 2.06918571e+04, 1.73350758e+04, 2.29914945e+04, -5.50771205e+03, 2.85330575e+04, 1.98521492e+03, 1.06942632e+04, 1.73230322e+04, 2.49513789e+03, -2.79106246e+03, 4.39038348e+04, 1.43537643e+04, 1.85450045e+04, -6.16542967e+03, 1.63629421e+04, 2.79125830e+04, 1.73431638e+04, 3.14934519e+04, 4.66349422e+03, 1.16135024e+04, 3.35813201e+04, 4.53795643e+03, 3.03687846e+04, 5.61531696e+03, 7.07152851e+03, 3.18497487e+04, 1.53772460e+04, 2.67107658e+03, 2.33955389e+04, 2.40698188e+04, -4.75904102e+03, -3.45112153e+03, 2.41953892e+04, 1.23263230e+04, 2.13734730e+04, 4.57129895e+03, 4.79790380e+03, 3.16235504e+04, 2.16213142e+04, 1.75585724e+04, 5.23850275e+03, 1.36123705e+04, 1.30277524e+04, 1.95346600e+04, 2.84475997e+03, 2.44038322e+04, 7.41182434e+03, 1.37544881e+04, -7.02660112e+02, 1.64984689e+04, 1.11341685e+04, 7.89961221e+03, 2.46923901e+04, 1.76240587e+04, 8.96931690e+03, 6.48185568e+03, -1.26222354e+03, -1.69218974e+03, 2.57214042e+03, 1.61502589e+04, 1.28138050e+04, 9.60814203e+03, 1.63259753e+04, 6.50911760e+03, -9.95762340e+03, 3.42057372e+02, 1.02190745e+04, 1.67276920e+04, 2.21600760e+04, 4.65490924e+03, 1.25402256e+04, 1.61009022e+04, 1.90925692e+04, 1.98924863e+04, 1.10061114e+04, 5.83648195e+03, 1.70692595e+04, 1.45910951e+04, 2.66219412e+04, 8.48576063e+03, 1.69487687e+04, 2.99414092e+03, 1.80086237e+04, 1.87362018e+04, 3.54003722e+04, -3.68164534e+02, 2.37253161e+04, 2.03710982e+04, 1.20499329e+04, 2.47400389e+04, -7.52342461e+03, 2.49023341e+04, 6.50286770e+03, 4.85828603e+03, 9.87788899e+03, 1.49300957e+04, 9.99666283e+03, 2.42095834e+04, 2.23392492e+03, 3.87104769e+03, 1.84381152e+04, 1.99189419e+04, 2.88502679e+04, 2.95576858e+03, 1.02739668e+04, 8.61212517e+03, 3.79011769e+04, 1.18256155e+04, 6.26209526e+03, 2.04705359e+04, 8.36627864e+03, 1.49795212e+03, 1.96112157e+04, 1.25041470e+04, 3.33043822e+03, 8.02272601e+03, 6.30190884e+01, 1.38667489e+04, -2.66181757e+03, 2.41768675e+02, 1.19301758e+04, 1.17904093e+04, 2.05007501e+04, -1.68721187e+04, 2.41243743e+04, 3.72096058e+03, 6.75666851e+03, 1.58471520e+04, 1.94148247e+04, 1.48026594e+03, 1.31713425e+04, 2.98162207e+04, -4.06148613e+03, 2.23375583e+04, 1.19660384e+04, 2.43317954e+04, 5.31000984e+04, 1.79335761e+04, 1.34151909e+04, -4.84794752e+03, 1.87710521e+04, 1.87750241e+04, 2.12204070e+04, 2.20818928e+04, 1.15592485e+03, -1.04175449e+04, -3.17875690e+03, 2.90112257e+04, 1.69765732e+04, 2.60161382e+04, 2.43898074e+04, 8.39128773e+02, 1.63886547e+04, -7.37612923e+03, 5.62485527e+03, 6.74975491e+03, 1.22944936e+04, 1.33810595e+04, 5.46762848e+03, 1.28159578e+04, 1.08639469e+04, 1.03348099e+03, 1.63132754e+03, 1.44072810e+04, 1.84246280e+04, -2.47909605e+03, 1.95474530e+04, 1.99982066e+04, 2.25021690e+04, 1.32872506e+04, 3.28203363e+03, 9.06686227e+03, 2.45238371e+04, 1.33661644e+04, -1.14433324e+03, 1.01026574e+04, 2.07442960e+04, 1.01544218e+04, 1.47549186e+04, 1.99151083e+04, 1.84707327e+04, 2.91420465e+03, 1.55694684e+04, 1.98216015e+04, 7.22538464e+03, 1.59688520e+04, 8.82264506e+03, -5.57106659e+03, -3.61325599e+03, 1.66954298e+04, 6.04080018e+03, 5.64313195e+03, 2.34846318e+04, 4.95252927e+03, 1.17609615e+04, 2.73460395e+02, 4.62524003e+04, 1.94219123e+04, 2.67636958e+03, 7.99161270e+02, 3.45401861e+04, 2.59576995e+03, 1.89228641e+04, 3.42841898e+04, 4.15955289e+03, 3.50195062e+04, 2.12903857e+04, -5.06633274e+03, 1.72083042e+04, 1.09837661e+04, 9.35620710e+03, 1.39696572e+04, 2.02365913e+04, 2.01273692e+04, 4.92787054e+03, 2.75630232e+04, 1.94245386e+04, 1.78279991e+04, 2.49990535e+04, 2.22097084e+04, 8.43216129e+03, 8.92326875e+02, 5.49338576e+03, 1.95493159e+04, 1.06933156e+04, 1.17082833e+04, 2.25027856e+04, 1.45716041e+04, 4.09754472e+03, 9.19128351e+03, 7.77620947e+03, 1.90177070e+04, 1.96619616e+04, 4.66622488e+04, 3.67881827e+03, 5.08111933e+03, 5.38303692e+03, 1.24476702e+04, 1.84716579e+04, -1.31531118e+03, 2.03243896e+04, 1.86851668e+04, 3.77256684e+03, 1.01734030e+04, 3.06560978e+04, 2.34528445e+04, 1.83387120e+04, 1.40144687e+04, 1.65206404e+04, 3.96428102e+04, 1.04960032e+04, -4.15137281e+03, 1.05653584e+04, 3.39106609e+04, 2.97230029e+04, 1.53912380e+04, 1.11446062e+04, 2.24223425e+04, 2.68597465e+04, 1.22405459e+04, -2.66504745e+03, 1.20825230e+04, 2.03241862e+04, -3.92528017e+03, 5.56118121e+03, -5.96828907e+03, 1.60675081e+04, 8.47372390e+03, 1.66574753e+04, -1.08819482e+04, 2.20162527e+04, 1.36954201e+04, 1.52982927e+04, 2.86445902e+04, 2.15596058e+04, 3.72854768e+03, 6.01809567e+03, 2.62442863e+04, 1.78847640e+04, 2.60260009e+04, 2.21141255e+04, -2.91738379e+03, 2.81612491e+04, 4.74778654e+04, 8.43042557e+03, 2.17905824e+04, 1.78861898e+04, 2.81132855e+04, 2.24379049e+04, 4.87238954e+03, 3.48894848e+04, 1.06339999e+04, 4.88834577e+03, 2.37027527e+04, 2.77041377e+04, 1.67256495e+04, -1.23767615e+03, 2.45406129e+04, -6.73009778e+03, -3.71221165e+03, 1.87645554e+04, 3.45908317e+03, 9.69112280e+03, 9.79096864e+03, -6.92458269e+03, 3.78754518e+03, -3.03440938e+03, 2.51852102e+04, -4.47402223e+03, 3.98987083e+04, 2.19649914e+04, 1.91549960e+04, 8.60118620e+03, 2.05626620e+04, -3.69035608e+03, 1.48537949e+04, 4.04843946e+03, 3.16774267e+04, 1.70576259e+04, 1.73557586e+03, 3.20480351e+04, 1.56235213e+04, 5.63278927e+03, 2.32403127e+04, 2.46781892e+04, 7.00114435e+03, -3.58186265e+03, 2.21807504e+04, -6.37439516e+03, 7.92053004e+03, 5.22713559e+03, 3.85323630e+03, 1.61069767e+04, 1.75678237e+04, 1.81873006e+04, 1.00464058e+04, 8.84400565e+03, 1.72847394e+04, 5.17322566e+02, 2.56655118e+04, 2.62752003e+04, 1.07591937e+04, 1.51229379e+04, 1.93102802e+04, 9.00728050e+03, 2.97264751e+03, -1.65027255e+03, 1.94821918e+04, -4.41337726e+03, 1.89499469e+04, 3.01911889e+01, 1.34545696e+04, 5.57215010e+03, 4.50681337e+02, 3.92060784e+04, 1.32187052e+04, 1.10678007e+04, 1.06085720e+04, -4.72793002e+03, 2.23111111e+04, 4.89426890e+03, 1.58645569e+04, -4.78889524e+03, 2.36842533e+03, 3.30683841e+04, 4.89473966e+02, 1.38929836e+04, 2.39812906e+04, -8.08396068e+03, 1.55943106e+04, 2.26993605e+04, 3.51540327e+04, 3.59982676e+04, 2.57322782e+04, 2.85020030e+04, 1.23762047e+04, 1.77077060e+04, 6.05218625e+02, 2.12468521e+04, 1.98247308e+03, 9.75425736e+03, 1.91635760e+04, 1.11030216e+04, 2.59148087e+04, 1.04661967e+04, 3.23588749e+04, 3.13942429e+03, -4.08853810e+03, 1.52620187e+04, -1.58765960e+04, 1.92577674e+04, 2.37955347e+04, 1.31788779e+04, 1.08961748e+04, 1.81281065e+04, 1.83373496e+04, 9.36618933e+03, 2.37138764e+04, 1.44469600e+04, 2.24502896e+02, 1.87964914e+04, 1.71068838e+04, -1.06577785e+03, 1.00160136e+04, 1.60681170e+04, 3.22133872e+04, 2.30064316e+04, 6.75391713e+03, 1.60378128e+04, 1.32996181e+04, -1.49122318e+02, 8.22587441e+03, 3.26499372e+04, 1.85186487e+04, -4.30577382e+03, 1.16753828e+04, 5.05871265e+03, 4.04057843e+04, 7.68904389e+03, 1.82256266e+04, -2.70297307e+03, 2.81515606e+04, -5.27176698e+03, 1.86762020e+04, 1.06676760e+04, 3.22475566e+04, -5.52105925e+03, 2.62373303e+04, 1.44986397e+04, -4.73905063e+03, 3.65046624e+04, 4.02480572e+03, 8.28482915e+03, 1.95814608e+04, 8.41202780e+03, -3.84857827e+03, 8.40586509e+03, 1.03176404e+04, 2.43958443e+04, 1.38968987e+04, 3.67708230e+04, 1.72569153e+04, 6.03696891e+03, -1.43788106e+03, 3.32488578e+04, 1.62319091e+04, 1.61211529e+04, 4.90873986e+03, 9.34189590e+03, -5.86994130e+03, 1.42400588e+04, 3.39109284e+02, 3.07350763e+03, 1.75517998e+04, 1.60393277e+04, 5.81477513e+02, -1.61141500e+03, 2.06980825e+04, 1.15261767e+04, 3.04436687e+04])
valid_y_pred_df = pd.DataFrame(valid_y_pred, columns = ["Validation_Prediction"])
valid_y_pred_df
Validation_Prediction | |
---|---|
0 | 2665.300452 |
1 | 23944.880671 |
2 | 15211.676647 |
3 | 24207.972902 |
4 | 29601.313799 |
... | ... |
641 | 581.477513 |
642 | -1611.414995 |
643 | 20698.082536 |
644 | 11526.176711 |
645 | 30443.668736 |
646 rows × 1 columns
Get the RMSE for validation set.
mse_valid = sklearn.metrics.mean_squared_error(valid_y, valid_y_pred)
mse_valid
380956622.57514906
# As before
# import math
rmse_valid = math.sqrt(mse_valid)
rmse_valid
19518.110117917386
valid_y.describe()
count 646.000000 mean 13535.160991 std 23624.770667 min 1105.000000 25% 4708.750000 50% 6750.500000 75% 12827.750000 max 301070.000000 Name: Wage, dtype: float64
# As before:
# If using the dmba package:
# pip install dmba
# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary
regressionSummary(valid_y, valid_y_pred)
Regression statistics Mean Error (ME) : 91.4105 Root Mean Squared Error (RMSE) : 19518.1101 Mean Absolute Error (MAE) : 9319.7708 Mean Percentage Error (MPE) : -43.3987 Mean Absolute Percentage Error (MAPE) : 118.8390
Residuals.
valid_residuals = valid_y - valid_y_pred
valid_residuals.head()
7882 2962.699548 14555 4930.119329 16210 -8270.676647 15847 -14063.972902 12382 16275.686201 Name: Wage, dtype: float64
import matplotlib.pyplot as plt
plt.hist(valid_residuals, bins = 30)
plt.title("Residuals for Validation")
plt.show()
valid_residuals_df = valid_residuals.to_frame(name = "Wage_Residuals")
valid_residuals_df
Wage_Residuals | |
---|---|
7882 | 2962.699548 |
14555 | 4930.119329 |
16210 | -8270.676647 |
15847 | -14063.972902 |
12382 | 16275.686201 |
... | ... |
8620 | 5455.522487 |
10786 | 6010.414995 |
16154 | 13639.917464 |
4990 | -8092.176711 |
14654 | -14046.668736 |
646 rows × 1 columns
import matplotlib.pyplot as plt
plt.hist(valid_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Validation")
plt.show()
Scikit-learn does not provide traditional regression model summaries.
Use statsmodels package if desired.
conda install -c conda-forge statsmodels
or
pip install statsmodels
import statsmodels.api as sm
model_statsmodels = sm.OLS(train_y, train_X)
results = model_statsmodels.fit()
print(results.summary())
OLS Regression Results ======================================================================================= Dep. Variable: Wage R-squared (uncentered): 0.515 Model: OLS Adj. R-squared (uncentered): 0.512 Method: Least Squares F-statistic: 159.1 Date: Sat, 18 Feb 2023 Prob (F-statistic): 4.36e-227 Time: 13:26:40 Log-Likelihood: -16865. No. Observations: 1506 AIC: 3.375e+04 Df Residuals: 1496 BIC: 3.380e+04 Df Model: 10 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- Age -852.0938 127.623 -6.677 0.000 -1102.432 -601.756 Balance 151.9979 40.583 3.745 0.000 72.392 231.604 ShotPower 633.7931 85.723 7.393 0.000 465.642 801.944 Aggression 15.2246 36.329 0.419 0.675 -56.037 86.486 Positioning 716.7647 94.926 7.551 0.000 530.562 902.968 Composure 382.7540 81.366 4.704 0.000 223.151 542.357 Preferred Foot_Right -493.9671 1360.831 -0.363 0.717 -3163.306 2175.372 Body Type_Lean -8.622e+04 4538.366 -18.999 0.000 -9.51e+04 -7.73e+04 Body Type_Normal -8.807e+04 4632.299 -19.011 0.000 -9.72e+04 -7.9e+04 Body Type_Stocky -8.937e+04 4906.471 -18.216 0.000 -9.9e+04 -7.97e+04 ============================================================================== Omnibus: 2072.969 Durbin-Watson: 1.959 Prob(Omnibus): 0.000 Jarque-Bera (JB): 660076.649 Skew: 7.548 Prob(JB): 0.00 Kurtosis: 104.446 Cond. No. 2.46e+03 ============================================================================== Notes: [1] R² is computed without centering (uncentered) since the model does not contain a constant. [2] Standard Errors assume that the covariance matrix of the errors is correctly specified. [3] The condition number is large, 2.46e+03. This might indicate that there are strong multicollinearity or other numerical problems.
new_players_df = pd.read_csv("new_players.csv")
new_players_df
Age | Balance | ShotPower | Aggression | Positioning | Composure | Preferred Foot_Right | Body Type_Lean | Body Type_Normal | Body Type_Stocky | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 27 | 59 | 75 | 68 | 80 | 76 | 1 | 0 | 0 | 1 |
1 | 21 | 42 | 71 | 52 | 60 | 76 | 1 | 1 | 0 | 0 |
2 | 19 | 76 | 80 | 22 | 75 | 56 | 0 | 0 | 0 | 1 |
new_records_players_pred = model.predict(new_players_df)
new_records_players_pred
array([29318.86943019, 20847.31938856, 29120.84519698])
# As before
# import pandas as pd
new_records_players_pred_df = pd.DataFrame(new_records_players_pred, columns = ["Prediction"])
new_records_players_pred_df
# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Prediction | |
---|---|
0 | 29318.869430 |
1 | 20847.319389 |
2 | 29120.845197 |
alpha = 0.05
ci = np.quantile(train_residuals, 1 - alpha)
ci
17225.89938935508
def generate_results_confint(preds, ci):
df = pd.DataFrame()
df["Prediction"] = preds
if ci >= 0:
df["upper"] = preds + ci
df["lower"] = preds - ci
else:
df["upper"] = preds - ci
df["lower"] = preds + ci
return df
new_records_players_pred_confint_df = generate_results_confint(new_records_players_pred, ci)
new_records_players_pred_confint_df
Prediction | upper | lower | |
---|---|---|---|
0 | 29318.869430 | 46544.768820 | 12092.970041 |
1 | 20847.319389 | 38073.218778 | 3621.419999 |
2 | 29120.845197 | 46346.744586 | 11894.945808 |