SAS

From Colettapedia
Jump to navigation Jump to search

Basics

SAS 9.4 DOCUMENTATION

PROC FORMAT

proc format;
 value cfmt 0 = "Non-Charter" 1 = "Charter";
 value ufmt 0 = "Rural" 1 = "Urban";
 run;

Example with string data

PROC FORMAT;
    VALUE $ soil_frmt 'STP' = 'Reconstructed prairie' 'REM' = 'Remnant prairie'
					'CUL' = 'Cultivated land';
    VALUE $ sterile_frmt 'Y' = 'yes' 'N' = 'no';
    VALUE $ species_frmt 'L' = 'Leadplant' 'C' = 'Coneflower';
RUN;

PROC IMPORT

 PROC IMPORT OUT= WORK.Charter_Wide
 DATAFILE= "G:\data\Classes\ST775\BYSH\BYSH Data sets and R scripts\Ch9 Two-
 Level Longitudinal\chart_wide_condense-SAS.csv"
 DBMS=CSV REPLACE;
 GETNAMES=YES;
 RUN;
  • load from file
    • limit number of lines on import
    • missover option
    • import from excel

Functions

  • LOG - natural logarithm
  • LAG - a function used in data step
  • PROBNORM - area_under_curve
    • probnorm((x - mu) / sigma); = P( (10-12.3) / 2.3 ) = P( X < 10 )

Other stuff

  • distribution functions
  • TITLE
  • PROC EXPORT
  • FILENAME infile "W:\SAS_projects\ST775\seeds2.csv";
  • LIBNAME musicraw "W:\SAS_projects\ST775\music";

DATA steps

  • DATA name;
  • INFILE 'K:\LG\iicbu\IICBU\colettace\SAS_examples\datasets\running.dat';
  • INPUT class 1 sex $ 3 race1_min 5 race1_sec 7-8 race2_min 10 race2_sec 12-13;
  • LABEL gestage="Gestational Age (days)" bweight="Birth Weight (grams)";
  • WHERE
  • SET other_data1 other_data2 other_data3;
  • BY varname;
  • DO
  • DROP varname1 varname2;
  • KEEP varname3 varname4;
  • put
  • FORMAT urban ufmt. charter cfmt.;
  • MERGE dataset1 dataset2; BY variablename;
    • MERGE Charter_Long nonc_2 (in = notc);

IF

  • IF varname ^= .
    • keep rows
    • THEN
    • ELSE
	IF instrument = "orche" THEN
		orch = 1;
	ELSE orch = 0;
	IF perform_type = "Large Ensemble" THEN
		large = 1;
	ELSE large = 0;

PROCS

PROC PRINT


PROC SORT

Syntax: PROC SORT <collating-sequence-option> <other option(s)>;  
      BY <DESCENDING> variable-1 <...<DESCENDING> variable-n>; 
      
The SORT procedure orders SAS data set observations by the values of one or more 
character or numeric variables. The SORT procedure either replaces the original 
data set or creates a new data set. PROC SORT produces only an output data set.

PROC UNIVARIATE

Example: Test for normality

title3 "Test whether the systolic BP for entire group is normally distributed";

proc univariate data=problem5_3 normal;
	var sys_bp;
	histogram sys_bp / normal midpoints=90 to 140 by 2.5;
	probplot / square;
run;

Other Example

proc univariate data=merged noprint;
	by gender notsorted;
	var height weight;
	hist height weight;
run;

PROC BOXPLOT

PROC BOXPLOT data=WORK.chart_long;
	PLOT MathAvgScore * charter / GRID HORIZONTAL BOXSTYLE=SCHEMATIC;
RUN;

PROC MEANS

Descriptive Statistics

proc means data=problem10_3 mean std min max;
   by sex;
   var race1_time;
run;

Hypothesis testing


* Mean time for girls in race 1 > 78s?";

DATA problem10_3;
   test_race1 = race1_time - 78;
RUN;
proc means data=problem10_3 t probt;
   by sex;
   var test_race1;
run;

Documentation

Syntax: PROC MEANS <option(s)> <statistic-keyword(s)>;
    BY <DESCENDING> variable-1 <... <DESCENDING> variable-n><NOTSORTED>; 
    CLASS variable(s) </ option(s)>; 
    FREQ variable; ID variable(s); 
    OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)> < / option(s)> ; 
    TYPES request(s); 
    VAR variable(s) < / WEIGHT=weight-variable>;
    WAYS list;
    WEIGHT variable; 


The MEANS procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations. For example, PROC MEANS

o calculates descriptive statistics based on moments 
o estimates quantiles, which includes the median 
o calculates confidence limits for the mean 
o identifies extreme values 
o performs a t test.

By default, PROC MEANS displays output. You can also use the OUTPUT statement to store the statistics in a SAS data set. PROC MEANS and PROC SUMMARY are very similar.

Example

** Obtain the average of the three math scores by school. ;
PROC MEANS DATA = Charter_Long noprint nway;
 class SchoolNum;
 var AvgMathScore;
 output out=math_mean mean = Mean_Math_Score;
RUN;

*     Get Charter and Urban to make plots;
DATA CU;
 set charter_wide;
 keep SchoolNum urban charter;
 run;

 data math_mean;
 merge math_mean CU;
 by SchoolNum;
RUN;

PROC FREQ

Pivot Table Example

PROC FREQ DATA=music;
    TABLES orch * large;
RUN;

Chi-Squared test of Proportion

proc freq data=problem10_8;
    table sex / chisq testp = (0.5, 0.5);
run;

Documentation

Keyword:  FREQ
Context: [PROCEDURE DEFINITION] PROC FREQ

Syntax: PROC FREQ <options> ; 
    BY variables ; 
    EXACT statistic-options </ computation-options> ; 
    OUTPUT <OUT=SAS-data-set> options ; 
    TABLES requests </ options> ; 
    TEST options ; 
    WEIGHT variable </ option> ; 

The FREQ procedure produces one-way to n-way frequency and contingency (crosstabulation) tables. 
For two-way tables, PROC FREQ computes tests and measures of association. For n-way tables, PROC 
FREQ provides stratified analysis by computing statistics across, as well as within, strata. 

For one-way frequency tables, PROC FREQ computes goodness-of-fit tests for equal proportions or 
specified null proportions. For one-way tables, PROC FREQ also provides confidence limits and 
tests for binomial proportions, including tests for noninferiority and equivalence. 

For contingency tables, PROC FREQ can compute various statistics to examine the relationships 
between two classification variables. For some pairs of variables, you might want to examine the 
existence or strength of any association between the variables. To determine if an association 
exists, chi-square tests are computed. To estimate the strength of an association, PROC FREQ 
computes measures of association that tend to be close to zero when there is no association and 
close to the maximum (or minimum) value when there is perfect association. The statistics for 
contingency tables include the following: 

  o chi-square tests and measures 
  o measures of association 
  o risks (binomial proportions) and risk differences for 2 x 2 tables 
  o odds ratios and relative risks for 2 x 2 tables 
  o tests for trend 
  o tests and measures of agreement 
  o Cochran-Mantel-Haenszel statistics

BY variables


Syntax: BY variables; 

You can specify a BY statement with PROC BCHOICE to obtain separate analyses of observations in groups that 
are defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be 
sorted in order of the BY variables. If you specify more than one BY statement, only the last one specified is used. 

If your input data set is not sorted in ascending order, use one of the following alternatives: 

  • Sort the data by using the SORT procedure with a similar BY statement. 
  • Specify the NOTSORTED or DESCENDING option in the BY statement for the BCHOICE procedure. The NOTSORTED 
    option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values 
    of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. 
  • Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).

TABLE options

Keyword:  CHISQ
Context: [PROC FREQ, TABLE STATEMENT] CHISQ option

Syntax: CHISQ <(chisq-options)> 
          
Requests chi-square tests of homogeneity or independence and measures of association that are based 
on the chi-square statistic. For two-way tables, the chi-square tests include the Pearson chi-square, 
likelihood ratio chi-square, and Mantel-Haenszel chi-square tests. The chi-square measures include the 
phi coefficient, contingency coefficient, and Cramér's V. For tables, the CHISQ option also provides 
Fisher's exact test and the continuity-adjusted chi-square test.  

For one-way tables, the CHISQ option provides the Pearson chi-square goodness-of-fit test. You can also 
request the likelihood ratio goodness-of-fit test for one-way tables by specifying the LRCHISQ chisq-option 
in parentheses after the CHISQ option. By default, the one-way chi-square tests are based on the null 
hypothesis of equal proportions. Alternatively, you can provide null hypothesis proportions or frequencies 
by specifying the TESTP= or TESTF= chisq-option, respectively. 

You can specify the following chisq-options in parentheses after the CHISQ option:

  DF=df
    pecifies the degrees of freedom for the chi-square tests. The value of df must not be zero.
  LRCHI 
    requests the likelihood ratio goodness-of-fit test for one-way tables.
  TESTF=(values) | SAS-data-set
    specifies null hypothesis frequencies for the one-way chi-square goodness-of-fit tests.
  TESTP=(values) | SAS-data-set
    specifies null hypothesis proportions for the one-way chi-square goodness-of-fit tests.
  WARN=value | (values) 
    controls the warning message for the validity of the asymptotic Pearson chi-square test. 
    By default, PROC FREQ displays a warning message when more than 20% of the table cells 
    have expected frequencies that are less than 5.

PROC CORR

proc corr data=problem15_1 nosimple;
    var house_size;
    with family_income;
run;

PROC REG

  • create a linear model
proc reg data=problem15_1 plots;
    model peak_hour_load = aircon_capac;
    plot peak_hour_load * aircon_capac;   * Plot of data with fitted line;
run;

PROC FMM

PROC FMM data=a (where=(Player=1));
    MODEL Score = /k=3 parms( -5 4 13 **GUESSES**);
RUN;

PROC SGPLOT

Bar plot

proc SGPLOT data=problem6_8;
	vbar momeduc;
run;

Histogram

title 'Mileage Distribution';
proc sgplot data=sashelp.cars;
  histogram mpg_city;
  density mpg_city  / type=normal legendlabel='Normal' lineattrs=(pattern=solid);
  density mpg_city  / type=kernel legendlabel='Kernel' lineattrs=(pattern=solid);
  keylegend / location=inside position=topright across=1;
  xaxis display=(nolabel);
  run;

Regression plot

proc sgplot data=elephant.data;
    REG x=AGE y=MATINGS;
RUN;

Boxplot

 proc sgplot data = Charter_Long;
 hbox AvgMathScore / group = charter;
 title 9.2a Average Math Score by Type of School;
 title2 All Data;
 run;

Scatterplot with regression line

 proc sgplot data = Charter_Long;
 scatter x = schPctsped y = AvgMathScore;
 reg x = schPctsped y = AvgMathScore;
 run;

PROC SGPANEL

Example Spaghetti plot by species with loess fit

PROC SGPANEL DATA=seeds_long;
  WHERE plant <= 71;
  PANELBY plant / COLUMNS=5 ROWS=5 spacing=8;
  SERIES X=time13 y=hgt / GROUP=plant LINEATTRS = (COLOR = gray);
  SCATTER X=time13 y=hgt;
  LOESS X=time13 Y=hgt / lineattrs = (color = black thickness =2);
 
RUN;

PROC TTEST

  • two sample ttest
  • pooled ttest
  • paired ttest

Example

proc ttest data=question6;
	class group_cat;
	var psa psa_ln;
run;

Documentation

Syntax: PROC TTEST <options> ; 
    CLASS variable ; 
    PAIRED variables ; 
    BY variables ; 
    VAR variables </ options> ; 
    FREQ variable ; 
    WEIGHT variable ; 

The TTEST procedure performs t tests and computes confidence limits for one sample, paired 
observations, two independent samples, and the AB/BA crossover design.

PROC NPAR1WAY

Wilcoxon

proc npar1way data=question6 wilcoxon;
	class group_cat;
	var psa psa_ln;
run;

PROC MIXED

  • PROC MIXED documentation
  • "A mixed linear model is a generalization of the standard linear model used in the GLM procedure, the generalization being that the data are permitted to exhibit correlation and nonconstant variability."
  • "The mixed linear model, therefore, provides you with the flexibility of modeling not only the means of your data (as in the standard linear model) but their variances and covariances as well."
  • "The primary assumptions underlying the analyses performed by PROC MIXED are as follows:
    • "The data are normally distributed (Gaussian).
    • "The means (expected values) of the data are linear in terms of a certain set of parameters.
    • "The variances and covariances of the data are in terms of a different set of parameters, and they exhibit a structure matching one of those available in PROC MIXED.

Example

 proc mixed data = charter_long noclprint;
 class SchoolNum;
 model AvgMathScore = Year0809 year0910 / s;
 random Int / sub = SchoolNum g gcorr;
 title 9.5.3 Piecewise linear model;
 run;

Example2

PROC MIXED DATA=chart_long NOCLPRINT CL COVTEST METHOD=REML;
	CLASS schoolid;
	MODEL MathAvgScore = charter urban schPctfree schPctsped 
        year08 charter*year08 urban*year08 schPctsped*year08
		/ SOLUTION RESIDUAL CL; 
	RANDOM INT year08/ SUBJECT=schoolid TYPE=UN G SOLUTION GCORR;
	ODS EXCLUDE WHERE=( _PATH_ ? 'ResidualPlots' );
	ODS EXCLUDE "The Mixed Procedure"."Solution for Random Effects";
RUN;

Example3

PROC MIXED DATA=music_final_model NOCLPRINT CL COVTEST METHOD=REML;
	CLASS id;
	MODEL na = previous students juried public solo mpqpem mpqab orch mpqnem mpqnem:solo / SOLUTION RESIDUAL CL; 
	RANDOM INT previous public / SUBJECT=id TYPE=UN G SOLUTION GCORR;
RUN;

PROC GENMOD

PROC GENMOD DATA=elephant.data;
    * CLASS ;
    MODEL matings=age age2/ DIST=poisson LINK=log;
RUN;

Quasiliklihood

PROC GENMOD DATA=elephant_quad_model;
    MODEL MATINGS=AGE / DIST=poisson LINK=log DSCALE;
RUN;

PROC SQL

PROC SQL ;
	CREATE TABLE elephant_summary AS
	SELECT AGE, MEAN(MATINGS) AS MEAN_MATINGS
	FROM elephant.data
	GROUP BY AGE;
QUIT;

PROC SURVEYSELECT

 proc surveyselect data = nonc
 out = nonc_2
 method = srs
 seed = 275214 SAMPSIZE=80;
 run;

Macros

Simple Example

%macro mytest( indep_var );
proc freq data=skyline;
	table gender * &indep_var / chisq;
	*table var2 / chisq cellchi2;
run;
%mend mytest;

%mytest( compare );
%mytest( argumentation );

Crazy Example

%macro sphyg_mixed(Y, YNAME, X, X2, FILE);

	ODS TRACE ON;
	ODS PDF FILE="K:\LG\iicbu\IICBU\colettace\SAS_projects\SardiNIA\output\&FILE.all_output.pdf"
            STYLE=HTMLEncore;
	TITLE1 Mixed Effects Model Results for &YNAME. for men and women combined.;
	PROC MIXED DATA=sphygdat NOCLPRINT NOITPRINT MAXFUNC=400 COVTEST;
		CLASS id_individual Sex machine_ver;
		MODEL &Y = &X &X2 / SOLUTION RESIDUAL CL OUTPRED=&Y._pred OUTPREDM=&Y._predm; * DDFM=KENWARDROGER ;
		RANDOM INT Time / SUBJECT=id_individual TYPE=UN G SOLUTION GCORR;
		ODS OUTPUT solutionf=&Y._sf(rename=(estimate=&Y._fe));
		ODS OUTPUT solutionr=&Y._sr(rename=(estimate=&Y._re));
		ODS EXCLUDE "The Mixed Procedure"."Solution for Random Effects";
	RUN;
	PROC EXPORT DATA=&Y._sf
   		OUTFILE= "K:\LG\iicbu\IICBU\colettace\SAS_projects\SardiNIA\output\&FILE._&Y._solutionf.csv" 
   		DBMS=csv REPLACE;
	RUN;
	PROC EXPORT DATA=&Y._sr
   		OUTFILE= "K:\LG\iicbu\IICBU\colettace\SAS_projects\SardiNIA\output\&FILE._&Y._solutionr.csv" 
   		DBMS=csv REPLACE;
	RUN;

	DATA sphyglib.&Y._predm;
		SET &Y._predm;
	RUN;
	* save/reload results to/from disk, needed this for some reason;
	DATA sphyglib.&Y._pred;
		SET &Y._pred;
		KEEP id_individual Wave reading agegroup &X &Y pred resid StdErrPred;
	RUN;
	DATA &Y._pred;
		SET sphyglib.&Y._pred;
		FORMAT Sex sex_frmt.;
		FORMAT agegroup agegroup_frmt.;
	RUN;
	*TITLE1 'Contents of &YNAME. dataset AFTER running model';
	*PROC CONTENTS DATA=&Y._pred;
	*RUN;

	TITLE1 &YNAME. Model Checks;
	TITLE2 Correlation between obs and pred values from LME model;
	PROC CORR DATA = &Y._pred;
		VAR &Y pred;
	RUN;

	goptions htext = 2 hby = 2;* colors = (black);
	symbol1 cv=black v=dot height = 0.5 i=none; 
	axis1 label = (a=90 'Observed');
	axis2 label = ('Predicted');

	PROC GPLOT DATA=&Y._pred;
		PLOT &Y * pred / vaxis = axis1 haxis = axis2;
		TITLE Obseved vs. Predicted for &YNAME;
	RUN;
	QUIT;

	goptions htext = 2 hby = 2;* colors = (black);
	symbol1 cv=black v=dot height = 0.5 i=none; 
	axis1 label = (a=90 'Residual');
	axis2 label = ('Predicted');

	PROC GPLOT DATA=&Y._pred;
		PLOT resid*pred / vref = 0 vaxis = axis1 haxis = axis2;
		TITLE Residuals vs. Predicted for &YNAME;
	RUN;
	QUIT;

	PROC REG DATA=&Y._pred PLOTS=NONE;
		MODEL pred = &Y / RSQUARE RMSE;
	RUN;
	QUIT;
	ODS PDF CLOSE;
	ODS TRACE OFF;
%mend sphyg_mixed;

%let covars = machine_ver Sex fage fage2 Time;
%let covars2 = fAge*Time;
%let other_covars = exmWeight exmHeight exmBMI exmWaist pwv labsGlicemia labsHdl labsTrigliceridi labsColesterolo;
%let exp_var_name_list= pwv_ln SP C_SP P_SP DP C_DP P_DP HR P_MEANP C_MEANP;
%let exp_var_desc_list = ln(PWV), Systolic Pressure, Central Systolic Pressure,
	Peripheral Systolic Pressure, Diastolic Pressure, Central Diastolic Pressure,
	Peripheral Diastolic Pressure, Heart Rate, Peripheral Mean Pressure,
	Central Mean Pressure;

/* macro function signature
%macro sphyg_mixed(Y, YNAME, X, X2, FILE);
*/

%local i this_var this_description;
%do i=1 %to %sysfunc( countw( &exp_var_name_list ) );
	%let this_var = %scan( &exp_var_name_list, &i );
	%let this_description = %scan( &exp_var_desc_list, &i );

	%sphyg_mixed( &this_var, &this_description, &covars &other_covars, &covars2 , &this_var._model2 );

Enterprise Miner