Recent Activity
data v_sort;
input v1$ v2$ V3$ ;
cards;
A B C
F K J
D E F
C K J
K L W
S V R
;
run;
how to sort all variables
... View more

0
7
One way to copy a dataset from let's say path1 to path2 is to use the libname and data statement as below
%let path1= something;
%let path2=something else;
%let fname=be_auto_prmjun2024;
libname source spde "&path1.";
libname dest spde "&path2";
data dest.&fname.;
set source.&fname;
run;
if we have many files, we coud use a filelist and a call execute and repeat the above script
What will be the unix command that we could use to copy spde file containing only the string prm (stand for premium) from many subfolders to other subfolders and of course only spde file.
Please provide the good unix command
Please p
... View more

0
1
data employees;
input department $ employee $;
datalines;
Sales John
Sales Mary
HR Alice
HR Bob
IT Eve
IT Charlie
;
run;
proc sql;
select department,
catx(',', employee) as employees
from
select distinct department,
employee
from employees
) group by employee,department;
quit;
how to get comma separated employees list for each department using proc sql
... View more

1
13
Scalable R Clustering and Visualization with SAS Viya Workbench by Anand Phand
Summary
In this blog, we dive into the seamless integration of R programming within the SAS Viya Workbench, highlighting how users can write and execute R code directly in notebooks. This powerful feature allows data scientists to leverage the flexibility of R while benefiting from the scalability and collaboration capabilities of the SAS environment.
We illustrate this through a complete end-to-end use case using the popular penguins ‘dataset. The journey begins with exploratory data analysis (EDA) to understand key patterns and distributions. We then handle missing values using imputation techniques, followed by feature scaling to prepare the data for clustering. Using the K-means algorithm, we uncover natural groupings within the dataset. Finally, we bring the analysis to life with insightful visualizations using R libraries like ggplot2, and plotly.
This use case highlights how R users can comfortably perform advanced analytics in the SAS Viya ecosystem, making it a versatile platform for modern data science workflows.
Introduction to SAS Viya Workbench
Hello R Users! With SAS Viya Workbench now you can choose your favorite IDE and start coding in R. Creating an R project is super easy with Viya Workbench as it takes minimal efforts to spin up session in the cloud in seconds. As per the project requirements, you can spin up your server by selecting number of cores, memory, and GPU support. In this blog, we will briefly go through the server setup and then perform a simple data analysis with Penguin dataset. We will perform clustering, an unsupervised technique, to create clusters based on features to identify 3 groups of penguin species and use a Plotly library to visualize results.
The following link describes a step-by-step procedure to start a workbench instance SAS Tutorial | Getting Started with SAS Viya Workbench for Learners. Once the resource is created for you, you can click on the options button to start a workbench instance.
After the status changes from stopped to running in green, you can select your preferable IDE from the given drop-down menu. For this blog, we will select Jupyter Lab-Python and R to launch.
And that is it! Your IDE will open with default workspace folder as your work directory and from the Launcher, tab options will be given to create a R or python notebooks.
Once your notebook is created, you can rename the file and start writing your code in R.
About Data: Penguin is a new Iris!
For many years, Iris (published by Annals of Eugenics in 1936 under the title: The use of multiple measurements in taxonomic problems By Sir Ronald A. Fisher) was the go to data for students, researchers, and practitioners to understand and study the statistical machine learning algorithms. Originally, this data was used for discriminant analysis and classification problems. It later proved to be an ideal data source for understanding segmentation, decision trees, support vector machines, logistic regression, etc.
Later, in the year 2020, another such dataset was published in the open-source programming world through R package “palmerpenguins”, collected in 2007-2009 by researcher Dr. Kristen Gorman. It soon gained its popularity across the community as another candidate for reliable data from a real-world study. In this blog, we will analyse penguins’ data with visualization tools, perform clustering and applying a dimension reduction technique using t-SNE algorithm that can be visualized with an interactive plot using plotly library in R.
Working with R on SAS Viya Workbench
We will install all required libraries and load it in the session.
We will load penguins’ data from palmerpenguins library as df and check the top 5 rows for data understanding. The dataset contains data on 3 penguin species observed in Antarctica, with features such as flipper length, body mass, bill dimensions, and island of observation.
Exploratory Data Analysis
We will perform basic analysis on the data to check presence of missing values, distribution of the numeric variables, and the frequency of the target variable, species.
First, use the str() function to check basic structure of the R object.
Get a count of each penguin species and summary statistics of all variables.
Missing Value Imputation
There are missing values present in the bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sex variables. Let’s write a utility function to handle missing values using mean imputation for numeric columns. Mean imputation is simple but can bias the dataset. More advanced methods like KNN or model-based imputation might be preferred depending on the analysis goals. For the sex variable we will simply remove the rows corresponding to the missing observations.
After handling missing values, the summary statistics show that there are no more missing values present in the data, and the row count reduced from 344 down to 333.
Feature Scaling and One-Hot Encoding
Next, we can perform feature scaling on numeric variables and one-hot encoding of the categorical variables.
Only the most important categorical levels are encoded, assuming “Torgersen” island and “female” sex are the base levels. Visualization for Distribution Analysis
Overlaying a histogram of a numeric variable by a category variable is a powerful visualization technique. Overlaying histograms helps you compare distributions of a numeric variable across distinct categories. This helps answer questions like:
Do the groups have different means or spreads?
Are the distributions skewed?
Are there overlaps between groups?
This is especially helpful for:
Clustering: Do groups naturally separate based on the variable?
Classification: Would this variable help a model distinguish categories?
If the histograms are well-separated, that variable is a strong predictor.
Here we have plotted distribution of Bill Depth (mm) variable which indicates significant difference in the distribution between Gentoo and other two species. Species ‘Adelie’ and ‘Chinstrap’ are overlapping, and there is no strong separation between the two. However, there could be other features present in the data that may have significantly different distribution between ‘Adelie’ and ‘Chinstrap’ species. You can try out plotting histograms with other feature variables and observe the differences.
Heatmap for clustering on variable and observations
A heatmap is a graphical representation of data where individual values are represented by colour. When applied to numeric variables, it provides deep statistical insights on their similarity with each other.
Heatmaps can show:
Clusters of variables with similar behavior (via hierarchical clustering)
Block patterns that might suggest latent structures in the data
In R, you can plot an intuitive heatmap of a numeric matrix that provide insights on grouping of observations as well as variables. This is useful for selecting features for clustering.
Now we are ready to apply a clustering algorithm on scaled data with few selected features that we identified from exploratory data analysis.
Apply K-Means Clustering
Using kmeans() function, we will apply a clustering algorithm on our scaled data with selected features only. As we know already that we have 3 species of penguins in the dataset, we will try generating 3 clusters based on features and then check if these predicted clusters are mapped or aligned with the actual species.
As we can see here, the cross tabulation shows that the predicted clusters are aligned with the known species labels, and the clustering algorithm has performed well with selected features.
Visualization of clusters using T-SNE for dimensionality reduction
A t-SNE plot (short for t-Distributed Stochastic Neighbor Embedding) is a powerful tool used in data visualization, particularly for high-dimensional data. Here though, because we have fewer number of variables, we can still try out this algorithm and see if we can visualize our clusters in a 3D plot.
t-SNE is particularly good at preserving local structure — meaning:
Points that are close together in high dimensions will remain close together in the 2D/3D plot.
This reveals natural clusters, even if no clustering algorithm was applied.
We will provide a full feature set for dimensionality reduction. T-SNE algorithm will reduce it to 3 variables and the final dataset, with transformed features, will be used for visualization.
Using plotly library, we will visualize the clusters in a 3-dimenssional plot
Concluding Remarks
Exploring R programming within the SAS Viya Workbench opens powerful possibilities for data scientists and analysts. The ability to write, execute, and manage R code seamlessly within a scalable, secure environment bridges the flexibility of open-source R with the enterprise strength of SAS. Through our end-to-end analysis of the penguins dataset — from exploratory data analysis and data cleaning to clustering and visualization — we've seen how easy it is to build complete analytical workflows. As more organizations embrace multi-language data science platforms, SAS Viya Workbench stands out as a versatile and collaborative space to unlock the full potential of R.
References:
https://www.sas.com/en_us/software/viya/workbench.html
https://allisonhorst.github.io/palmerpenguins/
https://medium.com/data-science/t-sne-clearly-explained-d84c537f53a
https://medium.com/@hdpoorna/export-3d-plots-in-python-with-plotly-dfa0cbff671c
https://uc-r.github.io/kmeans_clustering
... View more

0
0
DATA A; INPUT id name$ Height; datalines; 1 A 1 3 B 1 5 C 2 7 D 2 9 E 2 ; run; DATA B; INPUT id name$ weight; datalines; 2 A 2 4 B 3 5 C 4 7 D 5 ; run; ; proc sort data=A; by ID; run; ; proc sort data=B; by ID; run; DATA xyz; MERGE A B; BY ID; RUN;
... View more

0
20
Unanswered topics
These topics from the past 30 days have no replies. Can you help?