Paris-Saclay Center for Data Science kick-off meeting
The goal of this meeting is to officially launch the Paris-Saclay Center for Data Science. We gather data providers and data analysts around the common theme of data science. The three external keynote talks and the seven talks given by members of the CDS cover a wide spectrum of topics on both domain sciences and data science.
The event will take place in the main auditorium of the Linear Accelerator Laboratory, building 200 on the UPSud Orsay campus. Information on getting to LAL is available here.
The rapid technological developments in biology, in particular of DNA sequencing technologies, allow us to collect large amounts of molecular data about the genome of each individual, and opens the possibility to predict drug response or evaluate the risk of various diseases from one's molecular identity. In this talk I will discuss some regularization-based approaches we have developed to estimate complex, high-dimensional predictive models from relatively few samples, in particular in cancer prognosis and toxicogenetics.
(Mines ParisTech / Institut Curie)
Enhancing functional neuroimaging with meta-analytic approaches
Functional brain imaging offers a unique view on brain functional organization, which is broadly characterized by two features: the segregation of brain territories into functionally specialized regions, and the integration of these regions into networks of coherent activity. Among other observation modalities, magnetic resonance imaging yields a spatially resolved, yet noisy view of this organization. In this talk, I will discuss how the use of multiple datasets and machine learning tools can enhance the inference procedures that are necessary to go from data to knowledge on the brain.
(INRIA / Neurospin)
Data science in planetary science
Remote sensing is the major technique to study planetary environment in order to decipher the structure and evolution of solar system bodies. For a decade, spacecrafts have acquired high-resolution spectra, high-resolution images, hyperspectral images, and multi-angular hyperspectral images. The treatment of raw data to produce high level science results but also the visualization of the large amount of data require innovative tools. Here I review some aspects of data science projects in planetary science, focusing on multi-angular hyperspectral imaging (~500 wavelength), digital terrain model using stereoscopic techniques on high resolution images (~0.5m/pixel), and data visualization.
(GEOPS / UPSud)
Cosmology: from fundamental questions to computing challenges
The Big Bang cosmological model provides a powerful framework to describe the evolution of the Universe. Despite tremendous theoretical and observational progress in the field, profound mysteries such as the nature of dark matter and dark energy remain to be unveiled. After a brief introduction on cosmology, an overview of some of the large projects in astrophysics and cosmology in the next decade will be presented. These projects cover a broad range of the electromagnetic spectrum, from optical surveys (LSST, eBOSS, EUCLID), to future CMB (Cosmic Microwave Background, CORE2) missions and next generation radio interferometers (SKA). Some of the computing challenges faced by these projects will be highlighted, focusing on the LSST (Large Synoptic Survey Telescope) data management and processing case.
In the talk I will discuss work on learning to rank for information retrieval, in which the goal is to automatically construct a model that ranks documents in response to a query. In traditional supervised machine learning approaches for the LTR problem one manually selects a set of manually engineered ranking features and then learns the best way of combining them to obtain the most powerful ranking model that those features are capable of producing. In ongoing work on truly autonomous search engines, we are moving evaluation, learning and feature engineering to a weakly supervised paradigm, learning from the implicit feedback that naturally emerges as part of users' interactions with the search engine. I will discuss recent progress in each of these three dimensions: evaluation, learning and feature engineering.
Maarten de Rijke
(University of Amsterdam)
The digital transition: applications of machine learning to marketing, engineering sciences, and medicine
In every sector of human activity, the pervasiveness of sensors and the accumulation of digital information have raised novel intellectual challenges, dreams and fears. Recently, intensive research in the field of high dimensional statistics, the progress in the description and modeling of networks, and the second life of optimization theory have generated concepts and algorithms that allow to develop inference on complex data and also to think about new perspectives of interactions between experts or scientists of different fields. A major tension when addressing such issues from the viewpoint of applications is the balance between customization and reproducibility and, to my opinion, these two criteria should drive future innovations in the field of machine learning. In the talk, I will illustrate these ideas by going through a few recent achievements arising from interdisciplinary projects in the fields of digital marketing, fluid mechanics, and ethomics.
Designing and learning features for music information retrieval
This talk discusses a mix of concepts, problems and techniques at the crossroads of signal processing, machine learning and music. I will start by introducing content-based music information retrieval (MIR) as an important and challenging data science problem. Then, I will discuss recent work done at my lab on a variety of MIR problems such as automatic chord recognition, music structure analysis, cover song identification and instrument recognition. In the process of doing so, I'll review the impact of feature design for specific MIR tasks, suggest that existing feature extraction methods in audio can be re-conceptualized as deep, multi-layer and trainable systems combining affine transforms and subsampling operations, and show a few examples where deep learning matches or outperforms the current state of the art in music and sound classification. Finally, I’ll discuss open challenges and opportunities in the field.
Juan Pablo Bello
(New York University / Telecom ParisTech)
Direct-touch interaction for scientific visualization
Since the size and complexity of scientific datasets is growing at a very high rate, people are working on developing techniques to effectively depict and visualize them. However, frequently it is not sufficient to just produce a single static visualization but instead we have to support scientists in discovering aspects about the data that they did not know about it. That means that we have to develop effective interactive visualization tools that support scientists in exploring their data.
In my talk I will address the problem of interactively visualizing data that has an inherent mapping to the 3D spatial domain such as MRI scans, physical simulations, or molecular models. Specifically, I use interfaces on large, touch-sensitive displays because they tend to give people the feeling of "being in control of their data." That means we face the problem of providing input on a two-dimensional surface which needs to be mapped to manipulations of the three-dimensional data space. I will talk about FI3D, a technique to navigate in 3D datasets and control 7 degrees of freedom with only one or two fingers being used simultaneously. Next, I will discuss the problem of spatial data selection which is fundamental to further data analysis and also requires to define a 3D selection space with only input on a 2D plane. Finally, I discuss a case study in which we integrated several different interaction techniques into a tool for fluid mechanics experts to explore their data. I will end my talk by pointing out some open problems and research challenges that we are currently facing.
The data science challenges of particle physics
Particle physics poses several unique challenges for data science with multi-petabyte datasets, complex particle detectors, and the search for exceedingly rare signals in the data. The field is characterized by large, international collaborations, which requires a high-level of collaboration. I will give an overview of our data science challenges and discuss the statistical aspects of the recent discovery of the Higgs boson, including the collaborative statistical modeling techniques that are transforming the field. I will identify places where our tools and techniques are quickly evolving or beginning to fail and opportunities for fruitful collaboration with the nascent field of data science.
(New York University)
Challenges for data science initiatives – an innovation management perspective
Center for Data-Science (CDS) initiatives seem to pop up all around the globe at the moment. Considering the data deluge phenomena, the motivation behind such initiatives may seem trivial. However, a closer look reveals that the purpose, success conditions and managerial principles for CDS initiatives are much less clear. CDSs are neither private companies, nor traditional research entities. What would be a suitable organizational model and philosophy – designed to avoid pitfalls other science-based movements have faced in the history?
Beyond the seemingly trivial purpose of being (analytical) service providers, each such initiative needs to build their own strategy for survival, success and long-term impact. They also need to accomplish this feat in a way to differentiate themselves from other initiatives. To this end, the body of knowledge produced by management science in the form of methods, organizational models and best practices can be helpful. This talk will focus on some potential pitfalls and the potential contribution of design theory and innovation management methods for CDS initiatives.