PROC APPEND Alternatives
From sasCommunity
Contents |
[edit] Introduction
Suppose you have 10,000 small data sets, with identical structures, that are to be appended to a template or base data set. Can it be done without running PROC APPEND 10,000 times?
Here is an example, starting with the base:
data base; length source a $ 3 b $ 5 d $ 4; retain source a b d ; stop; run;
Here are three data sets representing the 10,000:
data one
two
e04
;
length a b c $ 4;
input (a b c)($);
source='one'; output one;
source='two'; output two;
source='e04'; output e04;
cards;
a1-- b1-- c1--
a2-- b2-- c2--
a2-- b3-- c3--
a2-- b2-- c2--
;
The three are alike in structure, but they differ in structure from the base. One variable is longer, one is shorter, one is absent in the base, and one is present in the base but not in the three sources.
[edit] PROC APPEND
This is the obvious:
proc append base=base data=one force; run; proc append base=base data=two force; run; proc append base=base data=e04 force; run;
The FORCE option is needed because of the structure discrepancies.
The downside is the overhead of 10,000 steps and all the messages in the log. Each step generates a series of messages like:
NOTE: Appending WORK.ONE to WORK.BASE.
WARNING: Variable c was not found on BASE file. The variable will not be added to the BASE file.
WARNING: Variable a has different lengths on BASE and DATA files (BASE 3 DATA 4).
WARNING: Variable b has different lengths on BASE and DATA files (BASE 5 DATA 4).
WARNING: Variable d was not found on DATA file.
NOTE: FORCE is specified, so dropping/truncating will occur.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: 4 observations added.
NOTE: The data set WORK.BASE has 4 observations and 4 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 0.15 seconds
cpu time 0.06 seconds
[edit] PROC DATASETS
PROC DATASETS can be used instead:
proc datasets nolist; append base=work.base data=work.one force; run; append base=work.base data=work.two force; run; append base=work.base data=work.e04 force; run; quit;
All is done in one step. However, the same cycle of messages for each append appears:
NOTE: Appending WORK.ONE to WORK.BASE. WARNING: Variable c was not found on BASE file. The variable will not be added to the BASE file. WARNING: Variable a has different lengths on BASE and DATA files (BASE 3 DATA 4). WARNING: Variable b has different lengths on BASE and DATA files (BASE 5 DATA 4). WARNING: Variable d was not found on DATA file. NOTE: FORCE is specified, so dropping/truncating will occur. NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: 4 observations added. NOTE: The data set WORK.BASE has 16 observations and 4 variables.
It seems that behind the scenes PROC APPEND and PROC DATASETS use the same code, so little performance gain should be expected.
[edit] PROC SQL
The INSERT statement can be used:
proc sql; insert into base ( source, a, b, d ) select source, a, b, ' ' from one; insert into base ( source, a, b, d ) select source, a, b, ' ' from two; insert into base ( source, a, b, d ) select source, a, b, ' ' from e04; quit;
The alignment of source and destination is based on position, not column name. Each statement generates just one message, like this one:
NOTE: 4 rows were inserted into WORK.BASE.
The processor opens and closes BASE for each statement, so it's probably not too efficient.
[edit] DATA step
The MODIFY statement can be used to open a data set for writing additional observations, as in:
data base; if 0 then modify base; set one two e04 open=defer; output; run;
There is one NOTE for each data set opened by SET and one overall NOTE generated by MODIFY:
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 4 observations read from the data set WORK.TWO.
NOTE: There were 4 observations read from the data set WORK.E04.
NOTE: The data set WORK.BASE has been updated. There were 0 observations rewritten, 12
observations added and 0 observations deleted.
This should be rather efficient, since BASE is opened only once.
Concern: Will this break or degrade in performance when the number of data sets read gets really large?
Instead of using the SET statement, the source data sets can be specified at run time and harvested using hash objects. For example:
data base; if 0 then modify base; if 0 then call missing(source, a, b); * CALL MISSING here is workaround for bug described in http://support.sas.com/kb/19/207.html . Bug appears to affect OUTPUT as well as REPLACE. Bug appears to be fixed in 9.2. ; do dsn = 'one', 'two', 'e04'; declare hash hh(dataset:dsn); declare hiter ii ('hh') ; hh.definekey ('source', 'a', 'b') ; hh.definedata ('source', 'a', 'b') ; hh.definedone () ; rc = ii.first(); do while (rc=0); output; rc = ii.next(); end; hh.delete(); end; stop; run;
The log shows:
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 4 observations read from the data set WORK.TWO.
NOTE: There were 4 observations read from the data set WORK.E04.
NOTE: The data set WORK.BASE has been updated. There were 0 observations rewritten, 9
observations added and 0 observations deleted.
Note that only 9 observations were added. That's because the hash object purges duplicate observations. A related side effect: order of observations is not preserved. Question: Can SAS 9.2 hash enhancements mitigate these side effects?
The hash approach should be more robust than the SET statement for really large numbers of data sets, and ought to be at least as fast.
