Wiki Agenda Contact Version française

Data-centric Languages and Systems

Overview

The data-centric languages and systems thematic aims at designing and developping programming langages as well as systems that seriously take into account massive data. The purpose is to build robust and efficient platforms on well founded theoretical grounds. Internet explosion and the ever growing importance of data in applications as well as the recent emergence of cloud computing, has given birth to a whirlwind of new data models (XML, JSON) and languages (XPath, XQuery, Pig, Jaql...). Whether they are developed under the banner of NoSQL (which stands for Not Only SQL), for BigData Analytics, for Cloud computing or as domain specific languages (DSL) embedded in a host language, most of them share a common subset of SQL and/or the ability to handle semistructured data. Such languages can greatly benefit from formal uniform foundations, and we argue that such foundations should account for novel features critical to various application domains. Also, most of those languages provide limited type checking, or ignore it altogether. We believe type checking is essential for many applications, with usage ranging from error detection to optimization.

Regarding the language topic, we are currently interested in the XML format, this led us to the design and implementation of CDuce an XML centric programming language .

The Datacert thematic is highly interested in the certification of data intensive systems with the Coq proof assistant. As data, in all forms, is increasingly large in volume, as a result of computers capturing more and more of the companies activity, the scientists' work and also as a result of better and more powerful automatic data gathering tools (e.g. space telescopes, focused crawlers, archived experimental data mandatory in some types of government-funded research programs). The availability and reliability of such large data volumes is a gold mine for companies, scientists and simply citizens. It is then of crucial importance to protect data integrity and reliability by means of robust methods and tools. Program verification and certification have been intensively studied in the last decades yielding very impressive results and highly reliable software (e.g., the Compcert project). However, and surprisingly, while the amount of data stored and managed by data engines has drastically increased, little attention has been devoted to ensure that such (complex) systems are indeed reliable. The aim of this research track is to certify data intensive systems such as relational database systems, XQuery and semi structured engines using formal tools such as SMT provers (e.g., Alt-ergo) embedded in the Why(3) platform as well as interactive theorem provers (e.g., Coq).

In general we are interested in the application of static analysis techniques (type theory, abstract interpretation...) and theorem proving to data-related problems (query optimisation, verification of integrity contraints). We also focus on adapting mainstream compilation techniques to data-centric language evaluators (just-in-time compilation of queries, evalutation of pattern-matching and path expression in the presence of XML indexes).

People

Véronique Benzaken, Évelyne Contejean, Stéfania Dumbrava (PhD), Hyeonseung Im (Post Doc), Kim Nguyen,

Internship proposals

Here

Software development

Datacert: a Coq library for the relational data model Coq Documentation [CoqDoc file]

Tree Automata Toolkit (TAToo): An XPath engine based on tree automata TaToo's development site

CDuce: an XML-centric functional programming language , part of major Linux distributions (Debian/Ubuntu, Fedora, Mandriva). cduce logo

Related Grants

Related Publications

Books and books chapters

International Journals

International Conferences

INRIA Saclay - Île-de-France              Université Paris-sud              CNRS              LRI