Data and Analytics Resources

Provide Confident Assurance to Your Organization

Introduction to Azure Data Science Virtual Machine

by Guest Blogger

Jan 22, 2018

This blog was written on Microsoft.com by Brad Severston, Gopi Kumar, Paul Shealy, and C.J. Gronlund. To read the original post, click here

Interested in helping other organizations modernize their data architecture? Apply to become a Microsoft Azure Consultant to business outcomes and strategy. 

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server and on Linux. We offer Windows edition of DSVM on Server 2016 and Server 2012. We offer Linux edition of the DSVM on Ubuntu 16.04 LTS and on OpenLogic 7.2 CentOS-based Linux distributions.

This topic discusses what you can do with the Data Science VM, outlines some of the key scenarios for using the VM, itemizes the key features available on the Windows and Linux versions, and provides instructions on how to get started using them.

What can I do with the Data Science Virtual Machine? 

The goal of the Data Science Virtual Machine is to provide data professionals at all skill levels and roles with a friction-free data science environment. This VM saves you considerable time that you would spend if you had rolled out a comparable environment on your own. Instead, start your data science project immediately in a newly created VM instance.

The Data Science VM is designed and configured for working with a broad range of usage scenarios. You can scale your environment up or down as your project needs change. You are able to use your preferred language to program data science tasks. You can install other tools and customize the system for your exact needs.

Key Scenarios 

This section suggests some key scenarios for which the Data Science VM can be deployed.

Preconfigured analytics desktop in the cloud 
The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden and saving on the time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.

Data science training and education 

Enterprise trainers and educators that teach data science classes usually provide a virtual machine image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.

On-demand elastic capacity for large-scale projects

Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments requiring high-powered computing resources to be run.

Short-term experimentation and evaluation 

The Data Science VM can be used to evaluate or learn tools such as Microsoft ML Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios such as replicating published experiments, executing demos, following walkthroughs in online sessions or conference tutorials.

Deep learning 

The data science VM can be used for training model using deep learning algorithms on GPU (Graphics processing units) based hardware. Utilizing VM scaling capabilites of Azure cloud, DSVM helps you use GPU-based hardware on the cloud as per need. One can switch to a GPU-based VM when training large models or need high-speed computations while keeping the same OS disk. The Windows Server 2016 edition of DSVM comes pre-installed with GPU drivers, frameworks and GPU version of the deep learning algorithms. On the Linux, deep learning on GPU is enabled only on the Data Science Virtual Machine for Linux (Ubuntu) edition. You can deploy the Ubuntu/Windows-2016 edition of Data Science VM to non GPU-based Azure virtual machine in which case all the deep learning frameworks will fallback to the CPU mode. Earlier, for Windows Server 2012 we published a Deep learning toolkit but now we recommend using Windows Server 2016 for Windows-based deep learning workloads. The CentOS-based Linux edition of the DSVM contains only the CPU builds of some of the deep learning tools (Microsoft Cognitive Toolkit, TensorFlow, MXNet) but does not come preinstalled with the GPU drivers and frameworks.

What's included in the Data Science VM?

The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products. You can explore and build predictive models on large-scale data sets using the Microsoft ML Server (R, Python) or using SQL Server 2017. A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks. The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.

ToolWindows EditionLinux Edition
Microsoft R Openwith popular packages pre-installedYY
Microsoft ML Server (R, Python)Developer Edition includes,
*RevoScaleR/revoscalepyparallel and distributed high-performance framework (R & Python)
*MicrosoftML- New state-of-the-art ML algorithms from Microsoft
*R and Python Operationalization
YY
Microsoft OfficePro-Plus with shared activation - Excel, Word and PowerPointYN
Anaconda Python2.7, 3.5 with popular packages pre-installedYY
JuliaProwith popular packages for Julia language pre-installedYY
Relational DatabasesSQL Server 2017
Developer Edition
PostgreSQL(CentOS only)
Database tools* SQL Server Management Studio
* SQL Server Integration Services
*bcp, sqlcmd
* ODBC/JDBC drivers
*SQuirreL SQL(querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
Scalable in-database analytics with SQL Server ML services (R, Python)YN
Jupyter Notebook Serverwith following kernels,YY
* RYY
* Python 2.7 & 3.5YY
* JuliaYY
* PySparkYY
*SparkmagicNY (Ubuntu Only)
* SparkRNY
JupyterHub (Multi-user notebooks server)NY
Development tools, IDEs and Code editors
*Visual Studio 2017 (Community Edition)>with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools,Node.js,Python, andR Tools for Visual Studio (RTVS)YN
*Visual Studio CodeYY
*RStudio DesktopYY
*RStudio ServerNY
*PyCharmNY
*AtomNY
*Juno (Julia IDE)YY
* Vim and EmacsYY
* Git and GitBashYY
* OpenJDKYY
* .Net FrameworkYN
PowerBI DesktopYN
SDKs to access Azure and Cortana Intelligence Suite of servicesYY
Data Movement and management Tools
* Azure Storage ExplorerYY
*Azure CLIYY
* Azure PowershellYN
*AzcopyYN
*Adlcopy(Azure Data Lake Storage)YN
*DocDB Data Migration ToolYN
*Microsoft Data Management Gateway: Move data between OnPrem and CloudYN
* Unix/Linux Command-Line UtilitiesYY
Apache Drillfor Data explorationYY
Machine Learning Tools
* Integration withAzure Machine Learning(R, Python)YY
*XgboostYY
*Vowpal WabbitYY
*WekaYY
*RattleYY
*LightGBMNY (Ubuntu Only)
*H2ONY (Ubuntu only)
GPU-based Deep Learning ToolsWindows Server 2016 editionUbuntu edition
*Microsoft Cognitive Toolkit (formerly known as CNTK)YY
*TensorFlowYY
*MXNetYY
*Caffe & Caffe2NY
*TorchNY
*TheanoNY
*KerasNY
*NVidia DigitsNY
*CUDA, CUDNN, Nvidia DriverYY
Big Data Platform (Devtest only)
* LocalSparkStandaloneNY
* LocalHadoop(HDFS, YARN)NY

Getting started with the Windows Data Science VM

  • Create an instance of the desired Windows DSVM edition by navigating to

    or

  • Click the GET IT NOW button.
  • Sign in to the VM from your remote desktop using the credentials you specified when you created the VM.
  • To discover and launch the tools available, click the Start menu.

Next steps 

For the Windows Data Science VM

For the Linux Data Science VM

Are you ready to start your next data science project? Apply to be a Microsoft Azure Consultant today and help companies prepare for the data of tomorrow. 

 

Leave a comment