SAS程序猿/媛在处理数据的时候,经常会遇到要处理有关重复记录的问题,其中有些重复记录是我们需要的,而有的则是多余的。如果是多余的直接去重:
- PROC SORT,其中有两个选项NODUPKEY、NODUPRECS(NODUP),第一个是按照BY变量来去重,第二是比较整条记录来去重,重复的记录可以用DUPOUT=来保留。程序如下:
proc sort data=sashelp.class out=unq nodupkey dupout=dup;
by WEIGHT;
run;
- HASH,程序如下:
data _null_;
if 0 then set sashelp.class;
if _n_=1 then do;
declare hash h(dataset: 'sashelp.class', ordered: 'y');
h.definekey('WEIGHT');
h.definedata(all:'y');
h.definedone();
end;
h.output(dataset: 'uni');
stop;
run;
如果重复记录是需要保留以备后用则可以用下面几种方法:
- DATA步,程序如下:
proc sort data=sashelp.class out=class;
by WEIGHT;
run;
data uni dup;
set class;
by WEIGHT;
if first.WEIGHT and last.WEIGHT then output uni;
else output dup;
run;
- PROC SQL,程序如下:
proc sql;
create table uni as
select *
from sashelp.class
group by WEIGHT
having count(*) = 1
;
create table dup as
select *
from sashelp.class
group by WEIGHT
having count(*) > 1
;
quit;
- HASH,程序(SAS9.2 )如下:
data uni(drop=rc: i);
if _n_=1 then do;
if 0 then set sashelp.class;
dcl hash h1(dataset: 'sashelp.class', multidata:'y');
h1.definekey('WEIGHT');
h1.definedata(all: 'yes');
h1.definedone();
dcl hash h2(dataset: 'sashelp.class');
dcl hiter hi('h2');
h2.definekey('WEIGHT');
h2.definedone();
end;
rc1=hi.first();
do while(rc1=0);
rc2= h1.find();
i=0;
do while(rc2=0 and i < 2);
i 1;
rc2=h1.find_next();
end;
if i < 2 then do;
output;
if i < 2 then h1.remove();
end;
rc1=hi.next();
end;
h1.output(dataset: 'dup');
run;
不管是去重还是保留重复的记录,上面几种方法中HASH行数都是最多的,但是这种方法在去重之前不用排序,故当处理的数据集较大时建议使用此方法以提高效率。