Cutting edge technology

Introduction: This introduction introduces linear regression: the relationship between multiple continuous variables and a continuous variable. Linear regression is divided into simple linear regression and multiple linear regression.

Variable analysis:

1 correlation analysis: the relationship between a continuous variable and a continuous variable.

2 two-sample t-test: the relationship between a binary categorical variable and a continuous variable.

3 Analysis of variance: The relationship between a multi-class categorical variable and a continuous variable.

4 Chi-square test: The relationship between a binary categorical variable or a multi-class categorical variable and a binary categorical variable.

This introduction:

Linear regression: the relationship between multiple continuous variables and one continuous variable.

where linear regression is divided into simple linear regression and multiple linear regression.

/ 01 / Data Analysis and Data Mining

Database: A tool for storing data. Because Python is an in-memory calculation, it is difficult to process tens of G of data, so sometimes data cleaning needs to be done in the database.

Statistics: Data analysis methods for small data, such as data sampling, descriptive analysis, and result testing.

Artificial Intelligence/Machine Learning/Pattern Recognition: Neural network algorithm, which mimics the operation of the human nervous system, can not only learn through training data, but also predict unknown data based on the learning results.

/ 02 / Regression Equation

01 Simple Linear Regression

Simple linear regression has only one independent variable and one dependent variable. The parameters contained in

are "regression coefficient", "intercept" and "disturbance item".

where "disturbance term" is also called "random error" and obeys a normal distribution with a mean of zero.

The difference between the actual value of the dependent variable and the predicted value of the linear regression is called the "residual".

Linear regression is designed to minimize the sum of squared residuals.

A simple linear regression is implemented below in the book case.

Establish a predictive model of income and monthly average credit card spending.

impoRt numpy as np

import pandas as pd

import statsmodels. Api as sm

import matplotlib. Pyplot as plt

from statsmodels. Formula. Api import ols

# Eliminate pandas output ellipsis and line breaks

pd. Set_option('display.max_columns', 500)

pd. Set_option('display.width', 1000)

# Read data, skipinitialspace: ignore the blank after the separator

df = pd. Read_csv('creditcard_exp.csv', skipinitialspace=True)

print(df.head())

reads the data as follows.

Correlation analysis of data.

# Get the line data of the credit card with spending

exp = df[df['avg_exp']. Notnull()]. Copy(). Iloc[:, 2:]. Drop('age2', axis=1)

# Get the line data of the credit card without spending, NaN

exp_new = df[df['avg_exp']. Isnull()]. Copy(). Iloc[:, 2:]. Drop('age2', axis=1)

# Descriptive statistical analysis

exp. Describe(include='all')

print(exp.describe(include='all'))

# Correlation analysis

print(exp[['avg_exp', 'Age', 'Income', 'dist_home_val']].corr(method='pearson'))

output.

found that the income (Income) and the average expenditure (avg_exp) are highly correlated, with a value of 0.674.

Modeling was performed using simple linear regression.

# Create a model using simple linear regression

lm_s = ols('avg_exp ~ Income', data=exp). Fit()

print(lm_s.params)

# Output model basic information, regression coefficients and test information, other model diagnostic information

print(lm_s.summary() The output of the

unary linear regression coefficient is as follows.

From the above, the regression coefficient value is 97.73 and the intercept value is 258.05. The

model overview is as follows.

where the R value is 0.454 and the P value is close to 0, so the model still has some reference significance.

The linear regression model was used to test the training data set to derive its predicted values ​​and residuals.

# The generated model uses predict to produce the predicted value, and resid is the residual of the training data set

print(pd.DataFrame([lm_s.predict(exp), lm_s.resid], index =['predict', 'resid']).T.head())

The output can be compared with the output when the data is first read.

Use the model to test the results of the predicted data set.

# Forecasting datasets using models to predict

print(lm_s.predict(exp_New)[:5])

output.

This article was written by the author of the cutting-edge technology. The views represent only the author and do not represent the OFweek position. If you have any infringement or other problems, please contact us.

Hot topic