As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.
Here are some tips for How to share your SAS knowledge with your professional network.
JMP Discovery/Summit '09 Day One
Wed: Things are going [phenomenally] great here at the 2009 JMP conference. Before going to Botanical Gardens, I visited with a few war heroes; actually, recipients of the Congressional Medal of Honor! (I’ll post my photo with three of them, including the current president of their organization.) There are 96 surviving recipients, and half of them are here for a convention coinciding with our JMP Discovery conference/training and JMP Summit. They also end Saturday evening and may receive a surprise visit.
Note to readers: These are just some outline reminders to myself; hope they help you also.
Visited John Sall, Diana Levey, Kathy Walker, and Arati Bechtel, briefly. John Sall introduced me for the table discussion (“those interested in Facebook”, he said), and then, at my prompting, “and LinkedIn.” Had a great discussion at our table and was pleased and grateful that JMP social-network leader, Arati Bechtel, visited with us.
You can see the entire JMP conference brochure at ::
you can read the Discovery submitted papers at JMP.com/live
which takes you to :: http://www.jmp.com/about/events/chicago2009/presentations.shtml
NOTES on JMP Summit & Discovery, Thursday Sep 17, 2009
Welcome for Thurs morning from Jeff Perkinson, introducing John Sall.
Five keynotes, w/George Box this evening, and Malcolm Gladwell on Friday. There are twelve on the JMP Steering Committee. #JMPcon is the Tweeter hashtag. See also JMP.com/ChicagoConference/ There are about 250+ attendees, including JMP staff, here at the conference.
JMP, helping people: “Thrilling Thursday” of Discovery 2009
Introducing John Sall, looking forward to a great JMP conference. Usually John talks about the new release JMP 9 in a year or so. JMP 8 was introduced recently and so he will discuss new features in JMP 8 that will open your eyes. Dealing with big models is one of the new features.
The way we make products better is by experimenting and learning. There are usually two levels of screening designs to learn efficiently. Biotech people could have 20 cases, two classes (cancer or not), but 2.5million factors. This then is a ‘big model’. John Sall ran a demo (on stage) using the attendee database. Brad Jones (in the audience) developed the code. Brad is project mgr for JMP development.
Dr. Stu Hunter worked on Mr Rogers to pull the trains through the tunnel. Now, it is said, he is legendary in Stat. Five aces of spades: “how big of an effect do you want to be?” John Sall removes his blindfold to figure out who the five aces are. He runs Proc Stepwise. Identified Karen and the other aces. Demonstrates picking out the cases with large-effect factor coefficients.
It’s called ‘super-saturated design’. Brad Jones is the JMP guru on it.
Next, John Sall talked about ‘model prediction validation’ and started with screening design. Then he showed ‘hold back’ of 20% to be the validation test group and then quickly automatically executed 520 runs on the 80%.
The modeling works better as you add the big factors and then watch decreasing as the many other factors are added in. You know when to optimally stop by comparing to the held-back modeling. (See the work of Anderson Bernham.)
Next John Sall showed some Stepwise regression and model averaging. There were 2.8 million models, taking eight seconds. There is a 64-bit version release of JMP for the larger model sets. This will do wonders in marketing and data mining: so only people that want junk mail will receive junk mail. Yeah!
John Sall showed recursive partitioning and bootstrap sampling. He then showed a stage-wise procedure called gradient boosting. A learning rate is included at each stage for tuning. There were four tuning parameter for this design experiment including how many ‘layers’ are used. ‘Gradient boosting’ in some instances got much better R**2 then ‘random forest’. Cross-validation can help avoid over-fitting. Tuning is needed to determine the number of factors to include, esp. in data mining.
We are having lots of fun and learning tons. Ref to Malcolm Baldwin.
Q/A: [Over your head, for sure, but super-fun to listen to -- Wowie!]
The SAS tool to use is Enterprise Miner; $1,000,000 Netflix competition.
On the stage was the laptop Dell running Windows and MS Powerpoint.
FYI, Team JMP is filming the main stage for an archive video. Can you find it?
“Fool’s Gold or the Mother Load” Professor Dick DeVeaux
His first illustration, with graphs and jokes, was the price of oil, with [Jimmy Carter] gasoline lines. Next, predicting the weight of players on the U Texas football team.
What makes Data Mining different (from the few values on a curve) ? There are 16 terabytes in the UPS tracking database. 1 Pb every 72 minutes transmitted via Google. [petabytes.]
TV show ‘ER’ was based on the Cook Co hospital where they have really interesting cases. Emergency room: “I am having chest pains” and staff asks questions in a triage decision tree. Data cleaning is required.
Professor DeVeaux showed neural nets in pictures, with X1, X2, … and an input layer z1, z2, z3, with a hidden layer going onto y the output layer. It is computationally intensive. Model building includes training, test, evaluate, and what he likes to call, “shaking the tree” to pair down the variables to a reasonable number.
Bootstrap aggregation is called, ‘bagging’ and ‘random forest variation’ comes in. You can end up with all the trees looking the same. Combined with ‘boosting’ the two methodologies compliment each other for better results. He uses neural-nets with bootstrap aggregation and boosting to see how his simpler models are doing.
Last example, HP backpack inkjet printer (did it go to the moon?) First analysis predicted that the problem-reports correlated with zip-codes and a map showed they were in the hot and high-elevation Rocky Mountain time zone. Ink was tested in the rain forest of WA, not for dusty desert. Hence, they learned to put in different ink.
Success in data mining comes from knowing what the problem is; getting the data right; data preparation (definitions, cleaning, feature creation, and transformations); EDM – exploratory data modeling (reduce dimensions); Secondary modeling stage (include graphics and data analysis); finally cycle back to the top.
Contact DeVeaux@Williams.edu and also http://www.twocrows.com
He also lists five textbooks he uses in his modeling/prediction classes.
Q/A: Manny Uy asks the first question (another professor.)
Q: Correlation vs Causality. How do you deal with bias?
A: We all deal with it, trying to be open but still restricting. Cross-validation cures some ills. [Example] Stu Hunter taught him park analysis.
LUNCH; Meet the Developers; Break-out Sessions for presentations.
Dr Diane Michelson introduced Dr Marie Gaudard
“Classification of Breast Cancer Cells” Dr Marie Gaudard, et.al.
They are finishing a book Visual Six Sigma -- Making Data Analysis Lean
Emphasize the value of visualization in any modeling endeavor. A typical model consists of 10 to 40 nuclei. Larger cells may be malignant or benign.
Dr Gaudard showed some visualization and then moved to analysis. Limiting the number of columns at first allows looking at basic statistics univariate bar-plots, including one-way analysis. Then scatter plot matrix shows two variables at a time, then three with color having red meaning malignant.
She adds correlation ellipses over the scatter plot matrix. The point is that it is very important to look at your data. Statistical graphics leads to understanding. She shortens the plots of logistic fit and for the ROC curve (receiver operating characteristic) to get them both on-screen. The area under the ROC curve gives an indication of goodness of fit, trying to approach 1.0 Dr Gaudard chooses a few fits that are close to 0.98 and then shows the 3d scatterplot.
The red triangle at the top of the partition report gives the user many options, including ROC Curve, Lift Curve, Leaf Report, K-Fold Cross Validation, and The Small Tree View.
They found the model with max area, max smoothness, and mean texture. Blue dots were nonmalignant and red dots cancerous.
Next graph showed the effect of model complexity, with test curve and model curve, showing what happens with over-fit.
You can define row states simply by highlighting and identifying the row. Other basic operations were demonstrated showing power and ease of JMP. It makes sense to classify a new obs as malignant only of [something] is more than 0.5 (50%). Surface p3d plot was shown of predicted values. You can reclassify by moving the 0.5 plane up or down. A mosaic plot was shown of logistic model. Next was stepwise logic modeling. JMP 9 may be quite different, she says. She next runs a partition model -- splits into two classes and split into five levels. The graphic of the tree hierarchy comes up. At this stage she is not tuning the models, just observing.
A neural net model uses three hidden nodes to come up with a diagnosis. Then K-Fold cross-validation is used. To distinguish goodness, she looks at the points that have been mis-characterized (misclassified.) One of the neural nets classifies all correctly. Assessment of final model uses mosaic plot and contingency table. The talk illustrated the concepts and was very well done and instructive.
Summary: Training, Test, and Validation Sets.
We construct three independent analysis data sets:
_ > A training set (60%) used to develop models.
_ > A test set (20%)
_ > A validation set (20%)
The paper was excellent, showing graphs and analysis in JMP.
“Conjoint Marketing Experimental Design Considerations”,
_ _ _ Chris Nachtsheim and Rob Reul
First, we set the stage for the business case; second, show JMP operations. DCE is Discrete Choice Experiments. Designed experiments bring us to results. Branding and packaging is important, such as in women’s fashions.
Stated Choice Methods, by Louviere Hensher, & Swait. Shows the technological frontier (curve) of 1/cost vs speed, “revealed choice.” Expand to get off the curve and go to what is possible. [Google: This book is a reference work dealing with the study and prediction of consumer choice behavior, concentrating on stated preference (SP) methods.]
Especially to pioneer a new market, to get off the curve you need to investigate what is possible.
He showed us four design experiments from this last year. A star chart shows the Relative Attribute Importance. In manufacturing, an experiment to test design can cost over $10,000 (each) and in his restaurant tests, a dollar each. Having them complete without fatigue and quit. Their DCE think tank meets once a week to explore frontier.
Rob turns the podium over to Chris. Some contract studies are global in nature, encompassing 15 nations (each). Used to work for General Mills and Pillsbury, and now a fast-food chain to do predictive modeling, etc. They have a million eMail addresses and no problem getting respondents.
Notation: C = choices sets per survey; P = profiles per choice set; A = attributes per profile (exp design factors) L = level of attribute a. It comes to a linear model. In 1974, Daniel McFadden developed conditional logit analysis and won a Nobel Prize. [Equation on the screen.] Choice inconsistency comes when respondent gets tired of giving true answers.
Showed considerations for Efficiency Loss Due to Fatigue.
Constructed a series of D-optimal designs (from JMP 8)
There is a trade off between C (choice sets) and P (number of profiles.) They learned to use two profiles per choice set, always!
Optimal designs using the proposed model with fatigue effects
- Optimal heteroscedastic design
- - use the quadrature scheme proposed by Gotwalt, Mones
- and Steinberg (2009)
- Ran-order conjoint experiments
- - rank order experiments add more info
- (Vermeulen, Goos, Vandebroek, 2008)
- -- but they also increase fatigue! - -- - Mixed Logistics design.
_ Fatigue leads to choice inconsistency
_ Two profiles per
Q: 24 choices per choices vs one choice from 24 people.
Leading into segmentation, since all customers are not equal.
A: If you can do it with four, you can do it with one.
Q: text or picture?
A: Pictures improve avoiding fatigue.
A coming strategy will take fatigue into account.
John Sall’s Question: Are you missing an important thing?
Number of attribute changes in the trial set?
Viz: variety of apples. One could start to look like an orange.
Don’t change more than a few to allow interaction changes.
Ans: Yes, we control it. Some things are approximations.
It is not just fatigue, it is confusion.
Serve up things that make sense.
Last question had to do with ability to go back and change during survey.
They usually can, before they close out and finalize.
Rotating the blocks served up initially can help in the analysis.
“Design Experiments that Changed the World”
Bradley Jones, Director of JMP R/D, several important experiments.
First Experiment (picture of Galileo and Leaning Tower of Pizza) was the mass and acceleration experiment, possibly in mythology. Ten pound and one pound cannon balls, dropped at the same time. [Except wind resistance, theoretically and experimentally they hit the ground at the same time. Think of it as ten one-pound balls.]
Rationalism vs Empiricism
Aristotle vs Galileo. Theoretical vs experimental.
Scotsman Lind wanted to improve the lives of sailor.
He thought he would try something acidic, such as vinegar.
Group 5 given lemons, etc. w/vitamin C and prevented scurvy.
Clinical trials are now mandated to approve any medical protocol.
R. A. Fisher and agricultural field trials. Father of designed experiments.
Four principles from him: 1. Fractorial Concept; 2. Randomization;
3. Blocking; 4. Replication.
He also contributed: factorial design, ANOVA, maximum likelihood and Fisher’s Information, and on and on and on, such as Latin Squares.
He increased corn grain yield from 1940 to 2000, from 30 to 130.
Hundreds of scientists are involved across America, all due to Fisher.
He published his book in 1925.
Plackett and Burman and proximity fuses.
George Box was working at ICI in England at the time.
“We should have lost”; but Hitler attacked Russia.
[Add also Northrop Black Widow anti-bomber plane to prevent night attacks.]
Anti-aircraft fire had to hit directly. Fuses allowed close proximity.
See their 1946 paper in Biometrika. (Non-regular orthogonal designs)
George Box started in chemistry. Mark Bailey works on JMP w/Bradley Jones.
“Men may come and men may go, but the yield of this process seems to always be 35%.” George Box was brought in to see what he could do. “We have already optimized this process”, plant manager said. “Which optimum do you mean?”, George Box answered. (He noted that the two identical plants had two different settings. How could they both be optimized?)
Add the axial points in every dimension to turn a two-level factorial experiment into looking at all the quadratic effects. (Box-Wilson design)
Stan Young and penicillin yield, Eli Lilly Company. They couldn’t make enough. It was a simple experiment, 2-cubed experiment. He had them try in the center and trials ended at the opposite corner from there they originally thought.
Peter Goos and sticky polypropylene. The optimal design of blocked and split-plot experiments. He went against tradition even though they said, “this is wrong.”
And lastly: Chris Nachtsheim uses coordinate exchange. This has application in design of experiments.
CHALLENGES: 1. Cheap and abundant energy; 2. Accessible pure water; 3. Food for everyone; Shelter for everyone.
Abundant Energy will assist the other three. [He suggests solar power.]
Dr. George E. P. Box (keynote tonight: Founding Father in Stat/Graph)
Author of Statistics for Experimenters, (signed and given to attendees)
Winner of multiple awards from statistical societies. Also a fellow of the Royal Society (harder than being knighted, FYI.) Box-Jenkins time-series analysis is just one of his many areas. He will talk about sequential design of experiments. <Applause, and then standing-ovation.> He sits to speak. [See http://en.wikipedia.org/wiki/George_Box ]
I’m reminded of the Sultan being brought a mule . . .
How can we find things out efficiently? Aristotle asked.
Fisher was a remarkable man; met him in 1943in Cambridge. He was extremely kind. Didn’t have computational speed of computers. They also realized that the various factors may act differently, including three-factor interaction.
Explained experiments to improve delivery of canned food.
He showed a diagram of ‘hyper-cube’ to visualize the needed experiment. (Told stories of early developers, some amazingly creative and some, humorous.)
For a bicycle climbing a hill: seat up; dynamo; handlebars; etc. ended up being controlled mainly by just two variables: dynamo and gears.
He compliments JMP on the tremendous progress they have made in methodologies and software, the things that can now be done.
Q: Book is best read and I look at it every week. What of the magical value of p=0.5 and its origins.
A: It is one thing to look and investigate but there are others also. It’s like other things when you start somewhere; the best you may have.
Q: Wartime brought urgency and need for efficiency. Where have design of experiment taken hold; and where not taken hold, and why not?
A: Just call it “six-sigma” and teach it.
Q: Grateful for your inventions, but not the Box Plot. You confided you would like to invent your own plot -- how is that going?
A: John Tuckey named it; and I don’t know why he did that. Sometimes I am irritated with him. I remember one time he came forward and thought he knew exactly what I was going to say; but he didn’t. “I’m going to take a vote: me or Tuckey -- and it came out in his favor.”
Q: You met Dr Fisher. Can you remember him? Design after?
A: Big difference (in Agriculture) was the immediacy of results. He had to wait for the harvest season.
Q: Read your book. Your impression of statistics today -- data mining; predictive modeling; analytics. “Statisticians are cool”, says the New York Times. But we don’t hear about design of experiments. PhD students didn’t know what a split design is.
A: In order to teach stat you only need a degree in math. So the math profs are statisticians that didn’t quite make it. People in the trenches of industry have learned. It is difficult to get some to understand. It is not about proven theory; but rather about being curious and finding out. JMP will assist doing that.
QA: Graphic All-ANOVA. Make it so you can see it; modify, scale the data. You have to see it and then arrive at the right conclusion.
Q: Besides yourself, who is the statistician influencing you?
A: Probably from the Army in 1943 trying to understand stat. Wrote away to London for a course in statistics. They said they didn’t have one. They sent instead a list of books and almost every book was written by Fisher. Another book was in education, another in medicine. Keep coming back to realize they were all students of Fisher. It wasn’t just theory; they were applying it in a sensible and practical manner.
The presentation related the statistical procedures that could be used.
This is the first of three days of JMP Discovery and JMP Summit ::
You can see the entire JMP conference brochure at
and live blogging with pictures at www.JMP.com/live