Introduction: This introduction introduces linear regression: the relationship between multiple continuous variables and a continuous variable. Linear regression is divided into simple linear regression and multiple linear regression.
1 correlation analysis: the relationship between a continuous variable and a continuous variable.
2 two-sample t-test: the relationship between a binary categorical variable and a continuous variable.
3 Analysis of variance: The relationship between a multi-class categorical variable and a continuous variable.
4 Chi-square test: The relationship between a binary categorical variable or a multi-class categorical variable and a binary categorical variable.
Linear regression: the relationship between multiple continuous variables and one continuous variable.
where linear regression is divided into simple linear regression and multiple linear regression.
/ 01 / Data Analysis and Data Mining
Database: A tool for storing data. Because Python is an in-memory calculation, it is difficult to process tens of G of data, so sometimes data cleaning needs to be done in the database.
Statistics: Data analysis methods for small data, such as data sampling, descriptive analysis, and result testing.
Artificial Intelligence/Machine Learning/Pattern Recognition: Neural network algorithm, which mimics the operation of the human nervous system, can not only learn through training data, but also predict unknown data based on the learning results.
/ 02 / Regression Equation
01 Simple Linear Regression
Simple linear regression has only one independent variable and one dependent variable. The parameters contained in
are "regression coefficient", "intercept" and "disturbance item".
where "disturbance term" is also called "random error" and obeys a normal distribution with a mean of zero.
The difference between the actual value of the dependent variable and the predicted value of the linear regression is called the "residual".
Linear regression is designed to minimize the sum of squared residuals.
A simple linear regression is implemented below in the book case.
Establish a predictive model of income and monthly average credit card spending.
impoRt numpy as np
import pandas as pd
import statsmodels. Api as sm
import matplotlib. Pyplot as plt
from statsmodels. Formula. Api import ols
# Eliminate pandas output ellipsis and line breaks
pd. Set_option('display.max_columns', 500)
pd. Set_option('display.width', 1000)
# Read data, skipinitialspace: ignore the blank after the separator
df = pd. Read_csv('creditcard_exp.csv', skipinitialspace=True)
reads the data as follows.
Correlation analysis of data.
# Get the line data of the credit card with spending
exp = df[df['avg_exp']. Notnull()]. Copy(). Iloc[:, 2:]. Drop('age2', axis=1)
# Get the line data of the credit card without spending, NaN
exp_new = df[df['avg_exp']. Isnull()]. Copy(). Iloc[:, 2:]. Drop('age2', axis=1)
# Descriptive statistical analysis
# Correlation analysis
print(exp[['avg_exp', 'Age', 'Income', 'dist_home_val']].corr(method='pearson'))
found that the income (Income) and the average expenditure (avg_exp) are highly correlated, with a value of 0.674.
Modeling was performed using simple linear regression.
# Create a model using simple linear regression
lm_s = ols('avg_exp ~ Income', data=exp). Fit()
# Output model basic information, regression coefficients and test information, other model diagnostic information
print(lm_s.summary() The output of the
unary linear regression coefficient is as follows.
From the above, the regression coefficient value is 97.73 and the intercept value is 258.05. The
model overview is as follows.
where the R value is 0.454 and the P value is close to 0, so the model still has some reference significance.
The linear regression model was used to test the training data set to derive its predicted values and residuals.
# The generated model uses predict to produce the predicted value, and resid is the residual of the training data set
print(pd.DataFrame([lm_s.predict(exp), lm_s.resid], index =['predict', 'resid']).T.head())
The output can be compared with the output when the data is first read.
Use the model to test the results of the predicted data set.
# Forecasting datasets using models to predict
output.This article was written by the author of the cutting-edge technology. The views represent only the author and do not represent the OFweek position. If you have any infringement or other problems, please contact us.