Monday, 28 December 2015

Creating graphs for data

I have selected Addhealth data set for analyzing various variables using Univariate graphs and association between different parameters using bivariate graphs.

Below is the SAS code for both kind of graphs:



LIBNAME mydata "/courses/d1406ae5ba27fe300" access=readonly;
DATA new; set mydata.addhealth_pds;
LABEL H1GI1Y="Year of birth"
   H1GH60="Weight of population"    
   H1GH30D="Diet pills consumption in last 7 days"
   H1GH30E="Laxatives consumption in last 7 days"
   H1GH31B="Exercised in last 7 days";
   
/* Grouping the ages into 5 groups */
IF H1GI1Y LE 73 THEN agegroup = 5;
ELSE IF H1GI1Y LE 76 THEN agegroup = 4;
ELSE IF H1GI1Y LE 79 THEN agegroup = 3;
ELSE IF H1GI1Y LE 82 THEN agegroup = 2;
ELSE IF H1GI1Y LE 85 THEN agegroup = 1;
ELSE IF H1GI1Y = 96 THEN agegroup = .;
 
IF H1GH60  =  996 THEN weightgroup=.;
IF H1GH60  =  998 THEN weightgroup=.;

/* Grouping the weights into 5 ranges */
IF H1GH60  <= 100 THEN weightgroup = 1;
IF H1GH60  >  100 AND H1GH60 <=200 THEN weightgroup = 2;
IF H1GH60  >  200 AND H1GH60 <=300 THEN weightgroup = 3 ;
IF H1GH60  >  300 AND H1GH60 <=400 THEN weightgroup = 4;
IF H1GH60  =  999 THEN weightgroup=5;
IF H1GH30D = 6   THEN H1GH30D=.;
IF H1GH30D = 7   THEN H1GH30D=.;
IF H1GH30D = 8   THEN H1GH30D=.;
IF H1GH30E = 6   THEN H1GH30E=.;
IF H1GH30E = 7   THEN H1GH30E=.;
IF H1GH30E = 8   THEN H1GH30E=.;
IF H1GH31B = 7   THEN H1GH31B=.;

PROC SORT; by AID;
PROC FREQ; TABLES agegroup weightgroup;
/* Univariate graphs*/
PROC GCHART; VBAR agegroup/Discrete TYPE=PCT width=30;
PROC GCHART; VBAR weightgroup/Discrete TYPE=PCT width=30;
PROC GCHART; VBAR H1GH30D/Discrete TYPE=PCT width=30;
PROC GCHART; VBAR H1GH30E/Discrete TYPE=PCT width=30;
PROC GCHART; VBAR H1GH31B/Discrete TYPE=PCT width=30;

/*Bivariate graphs showing association between two variables*/
PROC GPLOT;PLOT H1GH60*agegroup;

PROC GCHART; VBAR agegroup/discrete type=mean SUMVAR=H1GH60;
PROC GCHART; VBAR agegroup/discrete type=percent SUMVAR=H1GH31B;
PROC GCHART; VBAR weightgroup/discrete type=percent SUMVAR=H1GH31B;
RUN;


The output is as below:
agegroup
Frequency
Percent
Cumulative Frequency
Cumulative Percent
1
8
0.12
8
0.12
2
2563
39.42
2571
39.55
3
3480
53.53
6051
93.08
4
450
6.92
6501
100.00
Frequency Missing = 3
weightgroup
Frequency
Percent
Cumulative Frequency
Cumulative Percent
1
517
8.14
517
8.14
2
5502
86.63
6019
94.77
3
322
5.07
6341
99.84
4
7
0.11
6348
99.95
5
3
0.05
6351
100.00
Frequency Missing = 153
The univariate graph of agegroups:

This graph is unimodal with its highest peak at group 3. i.e birth year 77 to 79. It seems to be skewed towards left.
Univariate graph for weightgroup:
This graph is unimodal with its highest peak at group 2. It seems to be skewed towards right as there are higher frequencies on lower groups.
Univariate graph for diet pills consumption:
This graph shows that there is a large population that did not consume diet pills for losing weight in last 7 days.
Univariate graph for laxative consumption on last 7 days:
This graph shows that there is a large population that did not consume laxatives for losing weight in last 7 days.
Univariate graph for exercise in last 7 days:
The graph shows that a large number of people worked out during last 7 days, though majority did not work out.
Bivariate graph for agegroup Vs weight:
This graph shows that the majority of population with more weight lie in group 2 and 3. The common weight is something between 100 to 160 lbs.
This graph shows association between agegroup and mean of weight for each group.
This graph is unimodal with peak at group 3. it is skewed towards left.
This graph is unimodal with peak at group 2 and skewed towards right. It shows that people from weightgroup 2 are more conscious for weigh loss through exercise.
Summary:
The basic purpose of this study is to analyse the values obtained by different variables and their association. We will be looking into the association between agegroup and exercise routine of people and weightgroup and exercise routine.
The top 5 graphs above are Univariate graphs that give the graphical representation of the percentage held for the value on X-axis. For simplification and data management decision, Ihave categorized the groups into various age groups and weight groups.
For categorical variables, H1GH30D, H1GH30E and H1GH31B, values 0 and 1 are considered as these values represent whether diet pills or laxatives were consumed or not and so as for exerercise. Value 7 which represent "not applicable" has been skipped as these entries are not useful for our study.
Graph 6 onwards are bivariate graphs showing association between different variables like agegroup and weight, agegroup and exercise in last 7 days, weightgroup and exercise in last 7 days.
From the graphs, we can illustrate that people from weightgroup 2 are more health conscious and have exercised in last 7 days. from weightgroup 3 and 4, inspite of more weight, they are not inclined to exercise.
Similarly, from agegroup 2 and 3, people are more conscious for exercise. So we can infer that people are health conscious when they are under 40 years and then they tend to gain weight but do not work out. Similary, after a particular weight i.e 250lbs and above, people are not interested in losing weight by working out.










No comments:

Post a Comment