# HI6007 Statistics Assignment for Business Decisions

1. Use methods of descriptive statistics to summarize the data. Comment on the findings.

The below table presents the descriptive statistics for all variables including income, household size and amount charged:

 Descriptive Statistics Income Household Size Amount Charged Mean 43.48 3.42 3963.86 Standard Error 2.057785614 0.245930138 132.023387 Median 42 3 4090 Mode 54 2 3890 Standard Deviation 14.55074162 1.738988681 933.5463219 Sample Variance 211.7240816 3.024081633 871508.7351 Kurtosis -1.247719422 -0.722808552 -0.742482171 Skewness 0.095855639 0.527895977 -0.128860064 Range 46 6 3814 Minimum 21 1 1864 Maximum 67 7 5678 Sum 2174 171 198193 Count 50 50 50

From the above descriptive statistics, it can be determined that a sample of 50 consumers is taken by Consumer Research, Inc. to determine the consumer characteristics to predict the amount charged by the credit card users. At the same time, it is found that the average income of people is approx \$43480 and average household size is between 3 and 4 and average amount charged is \$3964 by the credit users.

Moreover, coefficient of variation is very high as a high variance of amount is charged by the credit users. In addition, maximum amount charged is \$5678 and minimum amount charged is \$1864.

Negative or less positive value of Kurtosis indicates that distribution the data is closed to the mean value (Tang and Zhang, 2013). It means that a distribution of the data is not consistent with the mean.  A kurtosis value of +/-1 is the most suitable for the empirical utilization. Skewness shows deviation of the values from the mean as acceptable value of Skewness is +/-1.

From the table, it can be evaluated that values of Skewness for all variables were close to +/-1 as all the variables passed the acceptability level for the empirical use.

The below table shows correlation between the variables:

 Correlation Income Household size Amount charged Income 1 Household size 0.172533 1 Amount charged 0.630781 0.752853835 1

There is a significant correlation between amount charged and income and between amount charged and household size. But, the correlation between household size and amount charged is more than the correlation between income and amount charged.

1. Develop estimated regression equations, first using annual income as the in- dependent variable and then using household size as the independent variable. Which variable is the better predictor of annual credit card charges? Discuss your findings.

Regression analysis for annual income and credit card charges:

 SUMMARY OUTPUT Regression Statistics Multiple R 0.630780826 R Square 0.39788445 Adjusted R Square 0.385340376 Standard Error 731.902474 Observations 50 ANOVA df SS MS F Significance F Regression 1 16991228.91 16991228.91 31.71891773 9.10311E-07 Residual 48 25712699.11 535681.2315 Total 49 42703928.02 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 2204.240517 329.1340306 6.697090887 2.14344E-08 1542.472207 2866.009 1542.472 2866.009 X Variable 1 40.46962932 7.185715961 5.631955054 9.10311E-07 26.02177931 54.91748 26.02178 54.91748

Regression equation:

Y= m X+ c

Y= 40.47X + 2204

Here,

X = Annual income

Y = Annual credit card charges

Regression analysis for household size and credit card charges:

 SUMMARY OUTPUT Regression Statistics Multiple R 0.752853835 R Square 0.566788897 Adjusted R Square 0.557763666 Standard Error 620.8162594 Observations 50 ANOVA df SS MS F Significance F Regression 1 24204112.28 24204112.28 62.80048437 2.86236E-10 Residual 48 18499815.74 385412.8279 Total 49 42703928.02 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 2581.644082 195.269886 13.22090228 1.287E-17 2189.027669 2974.26 2189.028 2974.26 X Variable 1 404.1567013 50.99977822 7.924675664 2.86236E-10 301.6147764 506.6986 301.6148 506.6986

Regression equation:

Y= m X+ c

Y= 404.16X + 2581.64

Here,

X = Household Size

Y = Annual credit card charges

From the above regression analysis, it can be interpreted that p-value is 9.1 and R2 is approx 0.40 for income as the independent variable.

It means about 40% of the variation in amount charged can be explained by annual income. On the other hand, p-value is 2.86 and R2 is approx 0.57 for household size as independent variable. It means household size explains about 57% of the variation in Amount Charged.

Household size variable is the better predictor of annual credit card charges than income because it explains about 57% of the variation in Amount Charged, against only about 40% for Annual Income.

1. Develop an estimated regression equation with annual income and household size as the independent variables. Discuss your findings.
 SUMMARY OUTPUT Regression Statistics Multiple R 0.908501824 R Square 0.825375565 Adjusted R Square 0.817944738 Standard Error 398.3249315 Observations 50 ANOVA df SS MS F Significance F Regression 2 35246778.72 17623389.36 111.0745228 1.54692E-18 Residual 47 7457149.298 158662.751 Total 49 42703928.02 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 1305.033885 197.770988 6.598712469 3.32392E-08 907.1699825 1702.898 907.17 1702.898 X Variable 1 33.12195539 3.970237444 8.342562845 7.88598E-11 25.13486801 41.10904 25.13487 41.10904 X Variable 2 356.3402032 33.22039979 10.72654771 3.17247E-14 289.5093801 423.171 289.5094 423.171

Y=m1 X1 + m2X2 +C

Y= 33.12 X1 + 356.34 X2 + 1305.03

Here,

Y= Amount Charged

X1 = Income

X2 = Household Size

Based on the above regression analysis, it can be determined that the value of R2 is 0.8253 means about 82.53% of the variation in amount charged can be explained by household size and income.

There is high level of significance for the single variables as compared to both variables together.  There is reduction in the standard error as compared to the previous simple linear regression models showing the improvement in the regression model.

1. What is the predicted annual credit card charge for a three-person household with an annual income of \$40,000?

It can be calculated by the equation obtained below from regression analysis of both variables together as independent:

Y= 33.12 X1 + 356.34 X2 + 1305.03

Y= 33.12 *40 + 356.34 *3 + 1305.03

= 1324.8 + 1069.02 + 1305.03

= \$3698.85

The predicted annual credit card charge for a three-person household with an annual income of \$40,000 is about \$3,699.

1. Discuss the need for other independent variables that could be added to the model. What additional variables might be helpful?

The following independent variables could be added to the model because of big standard error of the estimate in the regression model:

Number of credit card: Number of credit cards can provide valuable information about the consumer in measurable form. It can be expected that the higher number of credit cards is likely to be higher amount charged. However, there may be low correlation between multiple cards and income and household size (Little and Rubin, 2014).

Purchasing options: This variable can provide information about the consumers’ preferences over purchasing through cash or credit card. It can provide buying pattern of the consumer based on security measures and culture aspects.

Average age and gender ratio of a household: This demographic information of the consumer can be helpful to improve the existing model of estimation. It is believed that youth and females purchase more than males even through credit card, so this variable can help to refine the model and provide the accurate results. In addition, both data are also easy to collect but are independent from others.

Activity 1:

Activity 2:

(A)

(B)

Activity 3:

(A)

The correlations between ten different pairs of variables have been done for evaluating the marks of students under different exams. Different set of correlation and there results have been demonstrated as below:

Correlation

 Variable 1 Variable 2 Correlation HI001 FINAL EXAM HI002 FINAL EXAM 5% HI001 ASSIGNMENT 01 HI001 ASSIGNMENT 02 66% HI002 ASSIGNMENT 01 HI002 ASSIGNMENT 02 55% HI003 ASSIGNMENT 01 HI003 ASSIGNMENT 02 52% HI001 FINAL EXAM HI003 FINAL EXAM 12% HI002 FINAL EXAM HI003 FINAL EXAM 12% HI001 ASSIGNMENT 01 HI002 ASSIGNMENT 01 -13% HI001 ASSIGNMENT 02 HI002 ASSIGNMENT 02 -4% HI002 ASSIGNMENT 01 HI003 ASSIGNMENT 01 -23% HI002 ASSIGNMENT 02 HI003 ASSIGNMENT 02 -11%

(B)

1. Positive or Negative Correlated
 Variable 1 Variable 2 Correlated HI001 FINAL EXAM HI002 FINAL EXAM Positive HI001 ASSIGNMENT 01 HI001 ASSIGNMENT 02 Positive HI002 ASSIGNMENT 01 HI002 ASSIGNMENT 02 Positive HI003 ASSIGNMENT 01 HI003 ASSIGNMENT 02 Positive HI001 FINAL EXAM HI003 FINAL EXAM Positive HI002 FINAL EXAM HI003 FINAL EXAM Positive HI001 ASSIGNMENT 01 HI002 ASSIGNMENT 01 Negative HI001 ASSIGNMENT 02 HI002 ASSIGNMENT 02 Negative HI002 ASSIGNMENT 01 HI003 ASSIGNMENT 01 Negative HI002 ASSIGNMENT 02 HI003 ASSIGNMENT 02 Negative

1. Weak or Strong Correlation
 Variable 1 Variable 2 Correlation HI001 FINAL EXAM HI002 FINAL EXAM Weak HI001 ASSIGNMENT 01 HI001 ASSIGNMENT 02 Strong HI002 ASSIGNMENT 01 HI002 ASSIGNMENT 02 Strong HI003 ASSIGNMENT 01 HI003 ASSIGNMENT 02 Strong HI001 FINAL EXAM HI003 FINAL EXAM Weak HI002 FINAL EXAM HI003 FINAL EXAM Weak HI001 ASSIGNMENT 01 HI002 ASSIGNMENT 01 Weak HI001 ASSIGNMENT 02 HI002 ASSIGNMENT 02 Weak HI002 ASSIGNMENT 01 HI003 ASSIGNMENT 01 Weak HI002 ASSIGNMENT 02 HI003 ASSIGNMENT 02 Weak

1. Significance Value
 Variable 1 Variable 2 Significance Value HI001 FINAL EXAM HI002 FINAL EXAM Low HI001 ASSIGNMENT 01 HI001 ASSIGNMENT 02 High HI002 ASSIGNMENT 01 HI002 ASSIGNMENT 02 High HI003 ASSIGNMENT 01 HI003 ASSIGNMENT 02 High HI001 FINAL EXAM HI003 FINAL EXAM Low HI002 FINAL EXAM HI003 FINAL EXAM Low HI001 ASSIGNMENT 01 HI002 ASSIGNMENT 01 Low HI001 ASSIGNMENT 02 HI002 ASSIGNMENT 02 Low HI002 ASSIGNMENT 01 HI003 ASSIGNMENT 01 Low HI002 ASSIGNMENT 02 HI003 ASSIGNMENT 02 Low

1. It has been evaluated from the above collected information that significance value plays a vital in establishing the relationship between different variables. The high value helps in concluding the statement and facilitates in making wise decision for different activities (Cohen, et al., 2013).

4.In addition, the significance value reveals about the data, which has been collated that there is positive as well as negative relationship between different variables. In the positive with high significance value states that all the students are passed in there examination with good grades. Moreover, there is weak and strong correlation among the variables that determines the significance value (Guiso, et al., 2015). The strong correlation value reveals about the significance value that the data has high results and students are satisfied with the good marks in the examination.

1. Use descriptive statistics to summarize the data from the two studies. What are your preliminary observations about the depression scores?

Medical Study 1

 Descriptive Statistics Florida New York North Carolina Mean 5.55 8 7.05 Standard Error 0.478347 0.492042 0.634428877 Median 6 8 7.5 Mode 7 8 8 Standard Deviation 2.139233 2.200478 2.837252192 Sample Variance 4.576316 4.842105 8.05 Kurtosis -1.06219 0.626432 -0.904925496 Skewness -0.27356 0.625687 -0.056188269 Range 7 9 9 Minimum 2 4 3 Maximum 9 13 12 Sum 111 160 141 Count 20 20 20

In medical study 1, it can be identified that people in New York has higher depression as compared to Florida and North Carolina.

Medical Study 2

 Descriptive Statistics Florida New York North Carolina Mean 14.5 15.25 13.95 Standard Error 0.708965146 0.923024206 0.65884668 Median 14.5 14.5 14 Mode 17 14 12 Standard Deviation 3.170588522 4.127889737 2.946451925 Sample Variance 10.05263158 17.03947368 8.681578947 Kurtosis -0.340799481 -0.0301367 -0.592052134 Skewness 0.280721497 0.525352494 -0.041733773 Range 12 15 11 Minimum 9 9 8 Maximum 21 24 19 Sum 290 305 279 Count 20 20 20

From the study 2, it can be found that Individuals a chronic health condition such as arthritis, hypertension, and/or heart ailment in all locations have similar scores as high depression levels. When both the studies are compared then it can be analyzed that individuals 65 years of age or older with chronic health diseases have higher depression as compared to normal individuals (Tang and Zhang, 2013).

1. Use analysis of variance on both data sets. State the hypotheses being tested in each case. What are your conclusions?

Medical Study 1:

Hypothesis Formulation:

H0: µ1=µ2=µ3

H0 indicates no significant difference in the mean depression score of healthy people in the three locations.

Ha: µ1≠µ2≠µ3                                                                                                                                                                                                                                                                       Ha shows a significant difference in the mean depression score of healthy people in the three locations.

Here,

µ1= the mean depression score of healthy people in Florida

µ2= the mean depression score of healthy people in New York

µ3= the mean depression score of healthy people in North Carolina

Rejection Rule: The null hypothesis is rejected if, the calculated value of F statistic ≥ the F critical value or p-value ≤0.05)

ANOVA Single Factor:

 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 61.03333 2 30.51666667 5.240886 0.00814 3.158842719 Within Groups 331.9 57 5.822807018 Total 392.9333 59

Conclusion:

Here, F (5.24) is greater than Fcrit (3.15) as the null hypothesis is rejected (F statistic ≥ the F critical value and p-value ≤ 0.05). The sample provides enough evidence to support claim that there is a significant difference in the mean depression score of healthy people in the three locations.

Medical Study 2:

ANOVA Single Factor:

 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 17.03333333 2 8.516667 0.714212 0.493906 3.158843 Within Groups 679.7 57 11.92456 Total 696.7333333 59

Conclusion:

Here, F (0.714) is less than Fcrit (3.15) as the null hypothesis is accepted (F statistic ≤ the F critical value or p-value ≥0.05). The sample provides enough evidence to support claim that there is no significant difference in the mean depression score of Individuals having a chronic health condition such as arthritis, hypertension, and/or heart ailment in the three locations (Shipley, 2016).

1. Use inferences about individual treatment means where appropriate. What are your conclusions?

From the results, it can be inferred that in medical test 1, the mean depression score is related to geographical location as it differs for each location. In addition, people from New York have high depression score then individuals from other two locations.

In medical test 2, it can be inferred that mean depression score of individuals 65 years of age or older having chronic health condition is not linked to locations. The mean depression scores are similar in all three geographical locations.

References

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. UK: Routledge.

Guiso, L., Sapienza, P. and Zingales, L. (2015) The value of corporate culture. Journal of Financial Economics, 117(1), pp. 60-76.

Little, R. J., and Rubin, D. B. (2014) Statistical analysis with missing data. USA: John Wiley & Sons.

Shipley, B. (2016) Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference with R. Cambridge University Press.

Tang, Q. Y., and Zhang, C. X. (2013) Data Processing System (DPS) software with experimental design, statistical analysis and data mining developed for use in entomological research. Insect Science, 20(2), 254-260.