HI6007 Statistics Assignment for Business Decisions

Task 1

Use methods of descriptive statistics to summarize the data. Comment on the findings.

The below table presents the descriptive statistics for all variables including income, household size and amount charged:

*Descriptive Statistics*	*Income*	*Household Size*	*Amount Charged*
Mean	43.48	3.42	3963.86
Standard Error	2.057785614	0.245930138	132.023387
Median	42	3	4090
Mode	54	2	3890
Standard Deviation	14.55074162	1.738988681	933.5463219
Sample Variance	211.7240816	3.024081633	871508.7351
Kurtosis	-1.247719422	-0.722808552	-0.742482171
Skewness	0.095855639	0.527895977	-0.128860064
Range	46	6	3814
Minimum	21	1	1864
Maximum	67	7	5678
Sum	2174	171	198193
Count	50	50	50

From the above descriptive statistics, it can be determined that a sample of 50 consumers is taken by Consumer Research, Inc. to determine the consumer characteristics to predict the amount charged by the credit card users. At the same time, it is found that the average income of people is approx $43480 and average household size is between 3 and 4 and average amount charged is $3964 by the credit users.

Moreover, coefficient of variation is very high as a high variance of amount is charged by the credit users. In addition, maximum amount charged is $5678 and minimum amount charged is $1864.

Negative or less positive value of Kurtosis indicates that distribution the data is closed to the mean value (Tang and Zhang, 2013). It means that a distribution of the data is not consistent with the mean. A kurtosis value of +/-1 is the most suitable for the empirical utilization. Skewness shows deviation of the values from the mean as acceptable value of Skewness is +/-1.

From the table, it can be evaluated that values of Skewness for all variables were close to +/-1 as all the variables passed the acceptability level for the empirical use.

The below table shows correlation between the variables:

*Correlation*	*Income*	*Household size*	*Amount charged*
*Income*	1
*Household size*	0.172533	1
*Amount charged*	0.630781	0.752853835	1

There is a significant correlation between amount charged and income and between amount charged and household size. But, the correlation between household size and amount charged is more than the correlation between income and amount charged.

Develop estimated regression equations, first using annual income as the in- dependent variable and then using household size as the independent variable. Which variable is the better predictor of annual credit card charges? Discuss your findings.

Regression analysis for annual income and credit card charges:

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.630780826
R Square	0.39788445
Adjusted R Square	0.385340376
Standard Error	731.902474
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	1	16991228.91	16991228.91	31.71891773	9.10311E-07
Residual	48	25712699.11	535681.2315
Total	49	42703928.02

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	2204.240517	329.1340306	6.697090887	2.14344E-08	1542.472207	2866.009	1542.472	2866.009
X Variable 1	40.46962932	7.185715961	5.631955054	9.10311E-07	26.02177931	54.91748	26.02178	54.91748

Regression equation:

Y= m X+ c

Y= 40.47X + 2204

Here,

X = Annual income

Y = Annual credit card charges

Regression analysis for household size and credit card charges:

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.752853835
R Square	0.566788897
Adjusted R Square	0.557763666
Standard Error	620.8162594
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	1	24204112.28	24204112.28	62.80048437	2.86236E-10
Residual	48	18499815.74	385412.8279
Total	49	42703928.02

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	2581.644082	195.269886	13.22090228	1.287E-17	2189.027669	2974.26	2189.028	2974.26
X Variable 1	404.1567013	50.99977822	7.924675664	2.86236E-10	301.6147764	506.6986	301.6148	506.6986

Regression equation:

Y= m X+ c

Y= 404.16X + 2581.64

Here,

X = Household Size

Y = Annual credit card charges

From the above regression analysis, it can be interpreted that p-value is 9.1 and R²is approx 0.40 for income as the independent variable.

It means about 40% of the variation in amount charged can be explained by annual income. On the other hand, p-value is 2.86 and R²is approx 0.57 for household size as independent variable. It means household size explains about 57% of the variation in Amount Charged.

Household size variable is the better predictor of annual credit card charges than income because it explains about 57% of the variation in Amount Charged, against only about 40% for Annual Income.

Develop an estimated regression equation with annual income and household size as the independent variables. Discuss your findings.

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.908501824
R Square	0.825375565
Adjusted R Square	0.817944738
Standard Error	398.3249315
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	2	35246778.72	17623389.36	111.0745228	1.54692E-18
Residual	47	7457149.298	158662.751
Total	49	42703928.02

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	1305.033885	197.770988	6.598712469	3.32392E-08	907.1699825	1702.898	907.17	1702.898
X Variable 1	33.12195539	3.970237444	8.342562845	7.88598E-11	25.13486801	41.10904	25.13487	41.10904
X Variable 2	356.3402032	33.22039979	10.72654771	3.17247E-14	289.5093801	423.171	289.5094	423.171

Y=m₁ X₁+ m₂X₂ +C

Y= 33.12 X₁ + 356.34 X₂+ 1305.03

Here,

Y= Amount Charged

X₁ = Income

X₂ = Household Size

Based on the above regression analysis, it can be determined that the value of R² is 0.8253 means about 82.53% of the variation in amount charged can be explained by household size and income.

There is high level of significance for the single variables as compared to both variables together. There is reduction in the standard error as compared to the previous simple linear regression models showing the improvement in the regression model.

What is the predicted annual credit card charge for a three-person household with an annual income of $40,000?

It can be calculated by the equation obtained below from regression analysis of both variables together as independent:

Y= 33.12 X₁ + 356.34 X₂+ 1305.03

Y= 33.12 *40 + 356.34 *3 + 1305.03

= 1324.8 + 1069.02 + 1305.03

= $3698.85

The predicted annual credit card charge for a three-person household with an annual income of $40,000 is about $3,699.

Discuss the need for other independent variables that could be added to the model. What additional variables might be helpful?

The following independent variables could be added to the model because of big standard error of the estimate in the regression model:

Number of credit card: Number of credit cards can provide valuable information about the consumer in measurable form. It can be expected that the higher number of credit cards is likely to be higher amount charged. However, there may be low correlation between multiple cards and income and household size (Little and Rubin, 2014).

Purchasing options: This variable can provide information about the consumers’ preferences over purchasing through cash or credit card. It can provide buying pattern of the consumer based on security measures and culture aspects.

Average age and gender ratio of a household: This demographic information of the consumer can be helpful to improve the existing model of estimation. It is believed that youth and females purchase more than males even through credit card, so this variable can help to refine the model and provide the accurate results. In addition, both data are also easy to collect but are independent from others.

Task 2

Activity 1:

Activity 2:

(A)

(B)

Activity 3:

(A)

The correlations between ten different pairs of variables have been done for evaluating the marks of students under different exams. Different set of correlation and there results have been demonstrated as below:

Correlation

Variable 1	Variable 2	Correlation
HI001 FINAL EXAM	HI002 FINAL EXAM	5%
HI001 ASSIGNMENT 01	HI001 ASSIGNMENT 02	66%
HI002 ASSIGNMENT 01	HI002 ASSIGNMENT 02	55%
HI003 ASSIGNMENT 01	HI003 ASSIGNMENT 02	52%
HI001 FINAL EXAM	HI003 FINAL EXAM	12%
HI002 FINAL EXAM	HI003 FINAL EXAM	12%
HI001 ASSIGNMENT 01	HI002 ASSIGNMENT 01	-13%
HI001 ASSIGNMENT 02	HI002 ASSIGNMENT 02	-4%
HI002 ASSIGNMENT 01	HI003 ASSIGNMENT 01	-23%
HI002 ASSIGNMENT 02	HI003 ASSIGNMENT 02	-11%

(B)

Positive or Negative Correlated

Variable 1	Variable 2	Correlated
HI001 FINAL EXAM	HI002 FINAL EXAM	Positive
HI001 ASSIGNMENT 01	HI001 ASSIGNMENT 02	Positive
HI002 ASSIGNMENT 01	HI002 ASSIGNMENT 02	Positive
HI003 ASSIGNMENT 01	HI003 ASSIGNMENT 02	Positive
HI001 FINAL EXAM	HI003 FINAL EXAM	Positive
HI002 FINAL EXAM	HI003 FINAL EXAM	Positive
HI001 ASSIGNMENT 01	HI002 ASSIGNMENT 01	Negative
HI001 ASSIGNMENT 02	HI002 ASSIGNMENT 02	Negative
HI002 ASSIGNMENT 01	HI003 ASSIGNMENT 01	Negative
HI002 ASSIGNMENT 02	HI003 ASSIGNMENT 02	Negative

Weak or Strong Correlation

Variable 1	Variable 2	Correlation
HI001 FINAL EXAM	HI002 FINAL EXAM	Weak
HI001 ASSIGNMENT 01	HI001 ASSIGNMENT 02	Strong
HI002 ASSIGNMENT 01	HI002 ASSIGNMENT 02	Strong
HI003 ASSIGNMENT 01	HI003 ASSIGNMENT 02	Strong
HI001 FINAL EXAM	HI003 FINAL EXAM	Weak
HI002 FINAL EXAM	HI003 FINAL EXAM	Weak
HI001 ASSIGNMENT 01	HI002 ASSIGNMENT 01	Weak
HI001 ASSIGNMENT 02	HI002 ASSIGNMENT 02	Weak
HI002 ASSIGNMENT 01	HI003 ASSIGNMENT 01	Weak
HI002 ASSIGNMENT 02	HI003 ASSIGNMENT 02	Weak

Significance Value

Variable 1	Variable 2	Significance Value
HI001 FINAL EXAM	HI002 FINAL EXAM	Low
HI001 ASSIGNMENT 01	HI001 ASSIGNMENT 02	High
HI002 ASSIGNMENT 01	HI002 ASSIGNMENT 02	High
HI003 ASSIGNMENT 01	HI003 ASSIGNMENT 02	High
HI001 FINAL EXAM	HI003 FINAL EXAM	Low
HI002 FINAL EXAM	HI003 FINAL EXAM	Low
HI001 ASSIGNMENT 01	HI002 ASSIGNMENT 01	Low
HI001 ASSIGNMENT 02	HI002 ASSIGNMENT 02	Low
HI002 ASSIGNMENT 01	HI003 ASSIGNMENT 01	Low
HI002 ASSIGNMENT 02	HI003 ASSIGNMENT 02	Low

It has been evaluated from the above collected information that significance value plays a vital in establishing the relationship between different variables. The high value helps in concluding the statement and facilitates in making wise decision for different activities (Cohen, et al., 2013).

4.In addition, the significance value reveals about the data, which has been collated that there is positive as well as negative relationship between different variables. In the positive with high significance value states that all the students are passed in there examination with good grades. Moreover, there is weak and strong correlation among the variables that determines the significance value (Guiso, et al., 2015). The strong correlation value reveals about the significance value that the data has high results and students are satisfied with the good marks in the examination.

Task 3

Use descriptive statistics to summarize the data from the two studies. What are your preliminary observations about the depression scores?

Medical Study 1

Descriptive Statistics	Florida	New York	North Carolina
Mean	5.55	8	7.05
Standard Error	0.478347	0.492042	0.634428877
Median	6	8	7.5
Mode	7	8	8
Standard Deviation	2.139233	2.200478	2.837252192
Sample Variance	4.576316	4.842105	8.05
Kurtosis	-1.06219	0.626432	-0.904925496
Skewness	-0.27356	0.625687	-0.056188269
Range	7	9	9
Minimum	2	4	3
Maximum	9	13	12
Sum	111	160	141
Count	20	20	20

In medical study 1, it can be identified that people in New York has higher depression as compared to Florida and North Carolina.

Medical Study 2

Descriptive Statistics	Florida	New York	North Carolina
Mean	14.5	15.25	13.95
Standard Error	0.708965146	0.923024206	0.65884668
Median	14.5	14.5	14
Mode	17	14	12
Standard Deviation	3.170588522	4.127889737	2.946451925
Sample Variance	10.05263158	17.03947368	8.681578947
Kurtosis	-0.340799481	-0.0301367	-0.592052134
Skewness	0.280721497	0.525352494	-0.041733773
Range	12	15	11
Minimum	9	9	8
Maximum	21	24	19
Sum	290	305	279
Count	20	20	20

From the study 2, it can be found that Individuals a chronic health condition such as arthritis, hypertension, and/or heart ailment in all locations have similar scores as high depression levels. When both the studies are compared then it can be analyzed that individuals 65 years of age or older with chronic health diseases have higher depression as compared to normal individuals (Tang and Zhang, 2013).

Use analysis of variance on both data sets. State the hypotheses being tested in each case. What are your conclusions?

Medical Study 1:

Hypothesis Formulation:

H₀: µ1=µ2=µ3

H₀ indicates no significant difference in the mean depression score of healthy people in the three locations.

Ha: µ1≠µ2≠µ3 Ha shows a significant difference in the mean depression score of healthy people in the three locations.

Here,

µ1= the mean depression score of healthy people in Florida

µ2= the mean depression score of healthy people in New York

µ3= the mean depression score of healthy people in North Carolina

Rejection Rule: The null hypothesis is rejected if, the calculated value of F statistic ≥ the F critical value or p-value ≤0.05)

ANOVA Single Factor:

ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Between Groups	61.03333	2	30.51666667	5.240886	0.00814	3.158842719
Within Groups	331.9	57	5.822807018

Total	392.9333	59

Conclusion:

Here, F (5.24) is greater than Fcrit (3.15) as the null hypothesis is rejected (F statistic ≥ the F critical value and p-value ≤ 0.05). The sample provides enough evidence to support claim that there is a significant difference in the mean depression score of healthy people in the three locations.

Medical Study 2:

ANOVA Single Factor:

ANOVA
Source of Variation	SS	df	MS	F	P-value	F crit
Between Groups	17.03333333	2	8.516667	0.714212	0.493906	3.158843
Within Groups	679.7	57	11.92456

Total	696.7333333	59

Conclusion:

Here, F (0.714) is less than Fcrit (3.15) as the null hypothesis is accepted (F statistic ≤ the F critical value or p-value ≥0.05). The sample provides enough evidence to support claim that there is no significant difference in the mean depression score of Individuals having a chronic health condition such as arthritis, hypertension, and/or heart ailment in the three locations (Shipley, 2016).

Use inferences about individual treatment means where appropriate. What are your conclusions?

From the results, it can be inferred that in medical test 1, the mean depression score is related to geographical location as it differs for each location. In addition, people from New York have high depression score then individuals from other two locations.

In medical test 2, it can be inferred that mean depression score of individuals 65 years of age or older having chronic health condition is not linked to locations. The mean depression scores are similar in all three geographical locations.

References

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. UK: Routledge.

Guiso, L., Sapienza, P. and Zingales, L. (2015) The value of corporate culture. Journal of Financial Economics, 117(1), pp. 60-76.

Little, R. J., and Rubin, D. B. (2014) Statistical analysis with missing data. USA: John Wiley & Sons.

Shipley, B. (2016) Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference with R. Cambridge University Press.

Tang, Q. Y., and Zhang, C. X. (2013) Data Processing System (DPS) software with experimental design, statistical analysis and data mining developed for use in entomological research. Insect Science, 20(2), 254-260.