Statistics 506, Fall 2016

Programming environments for data management and analysis


Languages for data management and analysis

Any programming language can be used for data management and analysis, but most often, this activity is conducted using a domain specific language (DSL) that has been designed for this purpose. Some of the most popular DSL’s for statistics, math, and data analysis are:

(SPSS could be included in this list but it is primarily used as a GUI-driven application rather than as a programming language; the same is true of spreadsheet software like Excel)

Some very specialized languages can also be useful in data management and analysis. Languages of this type include SQL for querying databases, Stan for Bayesian modeling, AWK for processing text files, cURL for web scraping, and LaTeX for typesetting.

General purpose languages (GPL) are also commonly used for data management and analysis. Here are a few GPL scripting languages and data analysis libraries:

Some systems and application languages can be used for data management and analysis; coupled with libraries to support data mangagement and analysis procedures, these languages can yield very high performance code, but are somewhat harder to use well:

  • C

  • C++

  • Java

  • C#

  • Go

Finally, some languages (javascript, visual basic) are rarely used for data management and analysis, but have other important roles and you may encounter them at some point.

Some of the most important contributions to programming are tools and libraries, not languages. Examples of this would include the Boost libraries for C++ and the Lint utility for identifying problems in C code. Some “frameworks” like Hadoop, Spark, and database management systems (DBMS) are not usually considered to be “programming languages”, they are a type of data processing infrastructure that can interact with many different programming languages.

Currently, much of the attention in data-oriented computing is focusing on non-database tools for manipulating heterogeneous data, and language support for out-of-core, out-of-memory, and distributed data processing.

Some ways to organize languages

  • Interpreted versus compiled (any language can be either interpreted or compiled, but usually one form or the other is most common for a given language)

  • Dynamic versus static language; strong versus weak and strict versus static typing (see here)

  • High-level versus low-level

  • Imperative (procedural) versus declarative versus functional languages (see here). Most data analysis languages are primarily imperative, but Stan and SAS are examples of (primarily) declarative languages for data analysis.

  • Structured programming languages, including object oriented languages

  • “Expressive” languages (see here)

  • Open source versus proprietary; community-driven versus closed development; standards-based versus implementation-based.

Comments on performance

Some languages are generally faster than others, but there is also a lot of variation within languages based on how the code is written. With care, many (but not all) tasks can be carried out in R or Python (or Stata, etc.) at nearly the same speed as in C. However it is also possible (or easy) to write very slow code in high level languages.

In addition to the speed at which a program runs, we should consider also the time that it takes to develop, verify, and maintain the code. A program written in C may be five times faster than the corresponding Python program, but take several hours longer to develop, and when you come back to modify it a year later it may take you more time to remember how to understand the code.

What languages are currently widely used for data analysis?

  • Here is an analysis of the usage level of different statistical software packages. It emulates the TIOBE programming language rankings.

  • Here is another analysis of statistical software used in published papers involving health services research.