#credit_risk #modeling #SASprogram #coding
Sharing some SAS code for data processing
Data Input
Input data type: Excel. In this example, data starts from row 10.
proc import out = tb0_&file_date.
datafile = "/folder/input/report_&file_date..xlsx"
dbms = xlsx replace;
sheet = 'sheet1';
datarow=10; /*first data row*/
getnames=NO;
run;
/*Reference: find column names */
proc import out = col_header
datafile = "/folder/input/report_&file_date..xlsx"
dbms = xlsx replace;
range="'sheet1'$A6:BC6"; /*column row*/
getnames=NO;
run;
proc print data=col_header;run;
Convert data type
Character to numeric: INPUT(var, BEST.)
Numeric to character: PUT(var,$11.)
Combine or compile files
Use : for referencing file name.
/*Combine month1, month2, month3*/
Data month_1_to_3;
set month:;
run;
Check stats and distribution for numeric and categorical data
%let file_date = 201709;
/*numeric*/
proc sql noprint;
select name into : vars separated by " "
from dictionary.columns
where LIBNAME = upcase("work")
and MEMNAME = upcase("tb3_&file_date.")
and type='num'
;
quit;
/*get numeric variable names*/
%put variables = &vars.;
/*get min, max, mean*/
proc means data = tb3_&file_date. n nmiss mean median mode min max P1 P99;
var &&vars.;
run;
*********************************************************;
*********************************************************;
/*categorical*/
proc sql noprint;
select name into : vars_char separated by " "
from dictionary.columns
where LIBNAME = upcase("work")
and MEMNAME = upcase("tb3_&file_date.")
and type='char'
;
quit;
/*get categorical variable names*/
%put variables = &vars_char.;
/*get frequency*/
proc freq data=tb3_&file_date.;
table &&vars_char.;
run;
Remove format of a column
Sometimes there is unknown format masking a column. To retreive the original data, copy the column in a new column and it will automatically remove the format.
Strategy to combine data
- Clean up the mistake within one month file
- Month by month consistent
- Whole-year set
- Convert the whole-year set into desired data type
- Combines multiple years of data
Clean up manual mistakes and special character in numbers
Same columns came with different data type. Column with all missing value becomes character.
Some have special characters from manual mistake, such as “123..4%”
data tb3_&file_date.;
set tb2_&file_date.;
/*all missing so char --> numeric*/
Company_ID_NB2 = input(Company_ID_NB, BEST.);
/*Replace incorrect format such as 1.24.1 --> 1.24:1 */
If Debt_TBC = "1.24.1" then Debt_TBC2 = "1.24";
else if Debt_TBC = "1.43.1" then Debt_TBC2 = "1.43";
else Debt_TBC2 = Debt_TBC;
Rename and drop columns
data input.RAT_&file_date.;
set tb4_&file_date. ;
rename
Debt_TBC3 = Debt_TBC
Current_Ratio3 = Current_Ratio;
drop
Debt_TBC
Debt_TBC2
Current_Ratio
Current_Ratio2
;
run;
Missing value for numeric and character column
data test;
set tmp.code_change;
ID_Approved_by=''; /*character*/
Date_Approved=.; /*numeric*/
run;
Column Format: DATE
Define a data “201709”
Create a column called “scoringMDY” that captures this date (01SEP2017)
%let file_date = 201709;
data tb1_&file_date.;
set tb0_&file_date.;
format scoringMDY date9.; /*ex.01FEB2017*/
scoringMDY = mdy(%substr(&file_date.,5,2),1,%substr(&file_date.,1,4));
Group Name using RETAIN
data tb2_&file_date.;
set tb1_&file_date.;
/*TO-DO: group name*/
if index(compress(upcase(company_name)),"GROUP") > 0 then group_name = company_name;
retain group_name;
run;