Microsoft recently announced new Azure Data Lake services, which includes new tools to make data analytics easier. The new feature of interest in this post is the U-SQL language, which adds some new and powerful features to your data environment. In this article by Michael Rys, we get to see some details of this new toolset.
If you analyze the characteristics of Big Data analytics, several requirements arise naturally for an easy to use, yet powerful language:
- Process any type of data. From analyzing BotNet attack patterns from security logs to extracting features from images and videos for machine learning, the language needs to enable you to work on any data.
- Use custom code easily to express your complex, often proprietary business algorithms. The example scenarios above may all require custom processing that is often not easily expressed in standard query languages, ranging from user defined functions, to custom input and output formats.
- Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or limitations of a specific distributed infrastructure.
How do existing Big Data languages stack up to these requirements?
SQL-based languages (such as Hive and others) provide you with a declarative approach that natively does the scaling, parallel execution, and optimizations for you. This makes them easy to use, familiar to a wide range of developers, and powerful for many standard types of analytics and warehousing. However, their extensibility model and support for non-structured data and files are often bolted on and harder to use. For example, even if you just want to quickly explore your data in a file or remote data source, you need to create catalog objects to schematize file data or remote sources before you can query them, which reduces your agility. And although SQL-based languages often have several extensibility points for custom formatters, user-defined functions, and aggregators, they are rather complex to build, integrate, and maintain, with varying degrees of consistency in the programming models.
Programming language-based approaches to process Big Data, for their part, provide an easy way to add your custom code. However, a programmer often has to explicitly code for scale and performance, often down to managing the execution topology and workflow such as the synchronization between the different execution stages or the scale-out architecture. This code can be difficult to write correctly, and optimized for performance. Some frameworks support declarative components such as language integrated queries or embedded SQL support. However, SQL may be integrated as strings and thus lacking tool support, the extensibility integration may be limited or – due to the procedural code that does not guard against side-effects – hard to optimize, and does not provide for reuse.