Linear Statistical Models Regression Analysis Essay


Discuss about the Linear Statistical Models Regression Analysis.



The purpose of this study is determine the impact of particular population characteristics on health development. It seeks to determine some major aspects of the population based on the amount of expenditure set aside for health. Health development is among the most significant worldwide concerns as it involves and affects everyone. The matter also significantly affects other aspects such as population, economy, etc. It is for this reason that researchers, governments and business people have made an effort of trying to know the performance of the health sector in their countries, and other countries. Studies to determine the health development have been rapidly done between the years 2001 and 2015, across several aspects of the health sector such as birth rate, fertility rate, health expenditure, death rate, etc. Recently World Bank released data on health development in the past 15 years containing the health and population statistics. This data provides a rough estimate of health development across various countries, through fifteen years. Governments need to know how much they have been investing in health and try to predict what they will have to invest in the coming future from the data, in order to keep a good balance of the population. Businesses should also be able to know the percentage that people are willing to invest in their health and how they can create healthcare solutions. Clustering is a practice used in data analysis which involves gathering bunch of objects and separating them into sensible groupings of similar objects. This is done by determining how they are similar and how they are different.Jain (2010) explains clustering analysis is a study that deals with the methods and algorithms for clustering objects on the basis of some perceived or measured similarity. It does not use category labels that group objects with previous known identifiers.

K-means clustering

K-means clustering is a type of data clustering commonly used for performing unsupervised learning tasks (Ding & He 2004, pg. 29) This kind of clustering takes some point in space and put them into k groups (k is the number of groups you want).

Since we know that there are 7 countries included in the sample, we created an algorithm code that would group the data into 7 cluster. table(mydataCluster$cluster, mydata$Country) As we can see, the data belonging to the ASM got grouped into cluster 5, AUS into cluster 1, FJI into cluster 4, MHL into cluster 3. The algorithm wrongly classified three data points belonging to MNG and four data points belonging to WSM.

Regression analysis describes the relationships between two variables in a simplified mathematical model (Schneider, Hommel & Blettner 2010, p. 776). It predicts one variable based on other variable(s).

In this case, linear regression, death rate is used as the dependent variable (Y), and health expenditure as the independent variable (X). The prediction of death rate based on health expenditure is done using the linear model of the form:

Simple linear regression analysis is performed on the two variables at 95% confidence level. Performing linear regression of year as an independent variable and birth rate as a dependent variable. We will therefore use the variable year, to predict birth rate using the equation: #linear regression of death rate verses health expenditure


The first regression model has a p-value less than 0.05, this means that the model is significant and can therefore predict death rate based on health expenditure. The model however bears a low coefficient of determinant, of 0.03429, this means that the predictor variable only explains 3.4% of the variation in death rate. For the second regression model, the p-value also results in a conclusion that the model is significant and can predict birth rate based on health expenditure. The model has a slightly lower coefficient of determinant than the previous model. The R-squared value of 0.0209 implies that only 2.09% of the variation in birth rate is explained by health expenditure. Further studies should therefore be made to determine the other variables affecting these health development factors.

A major difficulty when performing the analysis was the fact that the data set was in a wide format and had to convert it to be in a long format to ease analysis. Another challenge was the fact that attributes were in rows instead of columns so in that case I had to convert the data to wide format. Finally this dataset had a lot of missing values which made it difficult to do the analysis, this prompt the use of na.rm method in R when performing most analysis. From this challenges I learnt how to convert dataset from one format to another let’s say wide to long or vice versa also how to use functions to omit missing values (N/A) in analysis. In future analysis I will ensure that data is well structured and ready for analysis before starting the analysis.


Schneider, A., Hommel, G. and Blettner, M., 2010. Linear Regression Analysis. Dtsch ? Rztebl Int, 107(44), pp.776-82.

Neter, J., Kutner, M.H., Nachtsheim, C.J. and Wasserman, W., 1996. Applied linear statistical models (Vol. 4, p. 318). Chicago: Irwin.

Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), pp.651-666.

Ding, C. and He, X., 2004, July. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29). ACM.

How to cite this essay: