HI6007 Statistics Assignment

HI6007 Statistics Assignment for Business Decisions

 

Task 1

  1. Use methods of descriptive statistics to summarize the data. Comment on the findings.

The below table presents the descriptive statistics for all variables including income, household size and amount charged:

Descriptive Statistics  Income Household Size Amount Charged
Mean43.483.423963.86
Standard Error2.0577856140.245930138132.023387
Median4234090
Mode5423890
Standard Deviation14.550741621.738988681933.5463219
Sample Variance211.72408163.024081633871508.7351
Kurtosis-1.247719422-0.722808552-0.742482171
Skewness0.0958556390.527895977-0.128860064
Range4663814
Minimum2111864
Maximum6775678
Sum2174171198193
Count505050

From the above descriptive statistics, it can be determined that a sample of 50 consumers is taken by Consumer Research, Inc. to determine the consumer characteristics to predict the amount charged by the credit card users. At the same time, it is found that the average income of people is approx $43480 and average household size is between 3 and 4 and average amount charged is $3964 by the credit users.

Moreover, coefficient of variation is very high as a high variance of amount is charged by the credit users. In addition, maximum amount charged is $5678 and minimum amount charged is $1864.

Negative or less positive value of Kurtosis indicates that distribution the data is closed to the mean value (Tang and Zhang, 2013). It means that a distribution of the data is not consistent with the mean.  A kurtosis value of +/-1 is the most suitable for the empirical utilization. Skewness shows deviation of the values from the mean as acceptable value of Skewness is +/-1.

From the table, it can be evaluated that values of Skewness for all variables were close to +/-1 as all the variables passed the acceptability level for the empirical use.

The below table shows correlation between the variables:

CorrelationIncome Household sizeAmount charged
Income 1
Household size0.1725331
Amount charged0.6307810.7528538351

There is a significant correlation between amount charged and income and between amount charged and household size. But, the correlation between household size and amount charged is more than the correlation between income and amount charged.

  1. Develop estimated regression equations, first using annual income as the in- dependent variable and then using household size as the independent variable. Which variable is the better predictor of annual credit card charges? Discuss your findings.

Regression analysis for annual income and credit card charges:

SUMMARY OUTPUT
Regression Statistics
Multiple R0.630780826
R Square0.39788445
Adjusted R Square0.385340376
Standard Error731.902474
Observations50
ANOVA
 dfSSMSFSignificance F
Regression116991228.9116991228.9131.718917739.10311E-07
Residual4825712699.11535681.2315
Total4942703928.02
 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept2204.240517329.13403066.6970908872.14344E-081542.4722072866.0091542.4722866.009
X Variable 140.469629327.1857159615.6319550549.10311E-0726.0217793154.9174826.0217854.91748

Regression equation:

Y= m X+ c

Y= 40.47X + 2204

Here,

X = Annual income

Y = Annual credit card charges

Regression analysis for household size and credit card charges:

SUMMARY OUTPUT
Regression Statistics
Multiple R0.752853835
R Square0.566788897
Adjusted R Square0.557763666
Standard Error620.8162594
Observations50
ANOVA
 dfSSMSFSignificance F
Regression124204112.2824204112.2862.800484372.86236E-10
Residual4818499815.74385412.8279
Total4942703928.02
 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept2581.644082195.26988613.220902281.287E-172189.0276692974.262189.0282974.26
X Variable 1404.156701350.999778227.9246756642.86236E-10301.6147764506.6986301.6148506.6986

Regression equation:

Y= m X+ c

Y= 404.16X + 2581.64

Here,

X = Household Size

Y = Annual credit card charges

From the above regression analysis, it can be interpreted that p-value is 9.1 and R2 is approx 0.40 for income as the independent variable.

It means about 40% of the variation in amount charged can be explained by annual income. On the other hand, p-value is 2.86 and R2 is approx 0.57 for household size as independent variable. It means household size explains about 57% of the variation in Amount Charged.

Household size variable is the better predictor of annual credit card charges than income because it explains about 57% of the variation in Amount Charged, against only about 40% for Annual Income.

  1. Develop an estimated regression equation with annual income and household size as the independent variables. Discuss your findings.
SUMMARY OUTPUT
Regression Statistics
Multiple R0.908501824
R Square0.825375565
Adjusted R Square0.817944738
Standard Error398.3249315
Observations50
ANOVA
 dfSSMSFSignificance F
Regression235246778.7217623389.36111.07452281.54692E-18
Residual477457149.298158662.751
Total4942703928.02
 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept1305.033885197.7709886.5987124693.32392E-08907.16998251702.898907.171702.898
X Variable 133.121955393.9702374448.3425628457.88598E-1125.1348680141.1090425.1348741.10904
X Variable 2356.340203233.2203997910.726547713.17247E-14289.5093801423.171289.5094423.171

Y=m1 X1 + m2X2 +C

Y= 33.12 X1 + 356.34 X2 + 1305.03

Here,

Y= Amount Charged

X1 = Income

X2 = Household Size

Based on the above regression analysis, it can be determined that the value of R2 is 0.8253 means about 82.53% of the variation in amount charged can be explained by household size and income.

There is high level of significance for the single variables as compared to both variables together.  There is reduction in the standard error as compared to the previous simple linear regression models showing the improvement in the regression model.

  1. What is the predicted annual credit card charge for a three-person household with an annual income of $40,000?

It can be calculated by the equation obtained below from regression analysis of both variables together as independent:

Y= 33.12 X1 + 356.34 X2 + 1305.03

Y= 33.12 *40 + 356.34 *3 + 1305.03

= 1324.8 + 1069.02 + 1305.03

= $3698.85

The predicted annual credit card charge for a three-person household with an annual income of $40,000 is about $3,699.

  1. Discuss the need for other independent variables that could be added to the model. What additional variables might be helpful?

The following independent variables could be added to the model because of big standard error of the estimate in the regression model:

Number of credit card: Number of credit cards can provide valuable information about the consumer in measurable form. It can be expected that the higher number of credit cards is likely to be higher amount charged. However, there may be low correlation between multiple cards and income and household size (Little and Rubin, 2014).

Purchasing options: This variable can provide information about the consumers’ preferences over purchasing through cash or credit card. It can provide buying pattern of the consumer based on security measures and culture aspects.

Average age and gender ratio of a household: This demographic information of the consumer can be helpful to improve the existing model of estimation. It is believed that youth and females purchase more than males even through credit card, so this variable can help to refine the model and provide the accurate results. In addition, both data are also easy to collect but are independent from others.

 Task 2

Activity 1:

Activity 2:

(A)

(B)

Activity 3:

(A)

The correlations between ten different pairs of variables have been done for evaluating the marks of students under different exams. Different set of correlation and there results have been demonstrated as below:

Correlation

Variable 1Variable 2Correlation
HI001 FINAL EXAMHI002 FINAL EXAM5%
HI001 ASSIGNMENT 01HI001 ASSIGNMENT 0266%
HI002 ASSIGNMENT 01HI002 ASSIGNMENT 0255%
HI003 ASSIGNMENT 01HI003 ASSIGNMENT 0252%
HI001 FINAL EXAMHI003 FINAL EXAM12%
HI002 FINAL EXAMHI003 FINAL EXAM12%
HI001 ASSIGNMENT 01HI002 ASSIGNMENT 01-13%
HI001 ASSIGNMENT 02HI002 ASSIGNMENT 02-4%
HI002 ASSIGNMENT 01HI003 ASSIGNMENT 01-23%
HI002 ASSIGNMENT 02HI003 ASSIGNMENT 02-11%

 (B)

  1. Positive or Negative Correlated
Variable 1Variable 2Correlated 
HI001 FINAL EXAMHI002 FINAL EXAMPositive
HI001 ASSIGNMENT 01HI001 ASSIGNMENT 02Positive
HI002 ASSIGNMENT 01HI002 ASSIGNMENT 02Positive
HI003 ASSIGNMENT 01HI003 ASSIGNMENT 02Positive
HI001 FINAL EXAMHI003 FINAL EXAMPositive
HI002 FINAL EXAMHI003 FINAL EXAMPositive
HI001 ASSIGNMENT 01HI002 ASSIGNMENT 01Negative
HI001 ASSIGNMENT 02HI002 ASSIGNMENT 02Negative
HI002 ASSIGNMENT 01HI003 ASSIGNMENT 01Negative
HI002 ASSIGNMENT 02HI003 ASSIGNMENT 02Negative

 

  1. Weak or Strong Correlation
Variable 1Variable 2Correlation
HI001 FINAL EXAMHI002 FINAL EXAMWeak
HI001 ASSIGNMENT 01HI001 ASSIGNMENT 02Strong
HI002 ASSIGNMENT 01HI002 ASSIGNMENT 02Strong
HI003 ASSIGNMENT 01HI003 ASSIGNMENT 02Strong
HI001 FINAL EXAMHI003 FINAL EXAMWeak
HI002 FINAL EXAMHI003 FINAL EXAMWeak
HI001 ASSIGNMENT 01HI002 ASSIGNMENT 01Weak
HI001 ASSIGNMENT 02HI002 ASSIGNMENT 02Weak
HI002 ASSIGNMENT 01HI003 ASSIGNMENT 01Weak
HI002 ASSIGNMENT 02HI003 ASSIGNMENT 02Weak

 

  1. Significance Value
Variable 1Variable 2Significance Value
HI001 FINAL EXAMHI002 FINAL EXAMLow
HI001 ASSIGNMENT 01HI001 ASSIGNMENT 02High
HI002 ASSIGNMENT 01HI002 ASSIGNMENT 02High
HI003 ASSIGNMENT 01HI003 ASSIGNMENT 02High
HI001 FINAL EXAMHI003 FINAL EXAMLow
HI002 FINAL EXAMHI003 FINAL EXAMLow
HI001 ASSIGNMENT 01HI002 ASSIGNMENT 01Low
HI001 ASSIGNMENT 02HI002 ASSIGNMENT 02Low
HI002 ASSIGNMENT 01HI003 ASSIGNMENT 01Low
HI002 ASSIGNMENT 02HI003 ASSIGNMENT 02Low

 

  1. It has been evaluated from the above collected information that significance value plays a vital in establishing the relationship between different variables. The high value helps in concluding the statement and facilitates in making wise decision for different activities (Cohen, et al., 2013).

4.In addition, the significance value reveals about the data, which has been collated that there is positive as well as negative relationship between different variables. In the positive with high significance value states that all the students are passed in there examination with good grades. Moreover, there is weak and strong correlation among the variables that determines the significance value (Guiso, et al., 2015). The strong correlation value reveals about the significance value that the data has high results and students are satisfied with the good marks in the examination.

Task 3

  1. Use descriptive statistics to summarize the data from the two studies. What are your preliminary observations about the depression scores?

Medical Study 1

Descriptive StatisticsFloridaNew YorkNorth Carolina
Mean5.5587.05
Standard Error0.4783470.4920420.634428877
Median687.5
Mode788
Standard Deviation2.1392332.2004782.837252192
Sample Variance4.5763164.8421058.05
Kurtosis-1.062190.626432-0.904925496
Skewness-0.273560.625687-0.056188269
Range799
Minimum243
Maximum91312
Sum111160141
Count202020

In medical study 1, it can be identified that people in New York has higher depression as compared to Florida and North Carolina.

Medical Study 2

Descriptive StatisticsFloridaNew YorkNorth Carolina
Mean14.515.2513.95
Standard Error0.7089651460.9230242060.65884668
Median14.514.514
Mode171412
Standard Deviation3.1705885224.1278897372.946451925
Sample Variance10.0526315817.039473688.681578947
Kurtosis-0.340799481-0.0301367-0.592052134
Skewness0.2807214970.525352494-0.041733773
Range121511
Minimum998
Maximum212419
Sum290305279
Count202020

From the study 2, it can be found that Individuals a chronic health condition such as arthritis, hypertension, and/or heart ailment in all locations have similar scores as high depression levels. When both the studies are compared then it can be analyzed that individuals 65 years of age or older with chronic health diseases have higher depression as compared to normal individuals (Tang and Zhang, 2013).

  1. Use analysis of variance on both data sets. State the hypotheses being tested in each case. What are your conclusions?

Medical Study 1:

Hypothesis Formulation:

H0: µ1=µ2=µ3

H0 indicates no significant difference in the mean depression score of healthy people in the three locations.

Ha: µ1≠µ2≠µ3                                                                                                                                                                                                                                                                       Ha shows a significant difference in the mean depression score of healthy people in the three locations.

Here,

µ1= the mean depression score of healthy people in Florida

µ2= the mean depression score of healthy people in New York

µ3= the mean depression score of healthy people in North Carolina

Rejection Rule: The null hypothesis is rejected if, the calculated value of F statistic ≥ the F critical value or p-value ≤0.05)

ANOVA Single Factor:

ANOVA
Source of VariationSSdfMSFP-valueF crit
Between Groups61.03333230.516666675.2408860.008143.158842719
Within Groups331.9575.822807018
Total392.933359

Conclusion:

Here, F (5.24) is greater than Fcrit (3.15) as the null hypothesis is rejected (F statistic ≥ the F critical value and p-value ≤ 0.05). The sample provides enough evidence to support claim that there is a significant difference in the mean depression score of healthy people in the three locations.

Medical Study 2:

ANOVA Single Factor:

ANOVA
Source of VariationSSdfMSFP-valueF crit
Between Groups17.0333333328.5166670.7142120.4939063.158843
Within Groups679.75711.92456
Total696.733333359

 

Conclusion:

Here, F (0.714) is less than Fcrit (3.15) as the null hypothesis is accepted (F statistic ≤ the F critical value or p-value ≥0.05). The sample provides enough evidence to support claim that there is no significant difference in the mean depression score of Individuals having a chronic health condition such as arthritis, hypertension, and/or heart ailment in the three locations (Shipley, 2016).

 

 

  1. Use inferences about individual treatment means where appropriate. What are your conclusions?

From the results, it can be inferred that in medical test 1, the mean depression score is related to geographical location as it differs for each location. In addition, people from New York have high depression score then individuals from other two locations.

In medical test 2, it can be inferred that mean depression score of individuals 65 years of age or older having chronic health condition is not linked to locations. The mean depression scores are similar in all three geographical locations.

References

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. UK: Routledge.

Guiso, L., Sapienza, P. and Zingales, L. (2015) The value of corporate culture. Journal of Financial Economics, 117(1), pp. 60-76.

Little, R. J., and Rubin, D. B. (2014) Statistical analysis with missing data. USA: John Wiley & Sons.

Shipley, B. (2016) Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference with R. Cambridge University Press.

Tang, Q. Y., and Zhang, C. X. (2013) Data Processing System (DPS) software with experimental design, statistical analysis and data mining developed for use in entomological research. Insect Science, 20(2), 254-260.

Leave a Comment