Monday, November 1, 2010

Mean data


Unfriendly, really.

(Yes, the picture is tiny and the format for this blog is poor for code. Working on that...)

The data above shows us another data irregularity. Each order, which can have multiple items, has the same sales ID number. But the customer info only appears in the first line. In order to perform detailed analysis such as "Which state purchases the most 49ers items?" I need every item to be associated with all buyer info.

Here is what I came up with. I use an if statement to select only those values with the same order number, as not all missing data represents a combined order. This also saves processing time vs checking all orders.

data ebaysale.june09e;
set ebaysale.june09e;
by Sales_Record_Number;
if last.Sales_record_number=0 or first.sales_record_number=0 then do;
array Ovar{3} user_id buyer_state Buyer_Country;
array Ovarn{2} shipping_and_handling total_price;
array Rvar{3} $ 55;
array Rvarn{2};
retain Rvar1-Rvar3 Rvarn1-Rvarn2 paid_on_dateR;
do i = 1 to 3;
if missing(Ovar{i}) then Ovar{i}=Rvar{i};
else Rvar{i}=Ovar{i};
end;
do i = 1 to 2;
if missing(Ovarn{i}) then Ovarn{i}=Rvarn{i};
else Rvarn{i}=Ovarn{i};
end;
if paid_on_date=. then paid_on_date=paid_on_dateR;
else paid_on_dateR=paid_on_date;
end;
run;

Head scratcher: I could NOT get date to work in the character or numeric array. If I put it in the numerical array, I got an error, and if I used the character array it did nothing. Also, missing values for date are '.', but the properties defines it as a text value. I suspect dates behave differently than normal character / numerical variables, but I haven't learned yet in what way.

Pleased with my progress thus far. I'm nearly to the point of having a program that will make a new dataset download from ebay immediately analysis friendly.

No comments:

Post a Comment