The goal of this second annual CDS pitching day is to review progress in ongoing CDS projects, prepare the next batch of projects, and to continue matching domain science demand to data science supply.
We gather data providers and data analysts around the common theme of Data Science. There will be reminder talks presenting the tools and platforms built by the CDS, review talks of ongoing projects, and pitching talks preparing new projects. The aim is to draw an overall picture about the demand for CDS tools and expertise in Saclay and to help PIs building their projects. More information is available in the call for contributions, in our previous project calls, and in the CDS proposal.
The event will take place in the main auditorium of the Digiteo Shannon building 660 (LRI). Information on getting to the venue is available here.
Future of nuclear energy is very uncertain and actively discussed. To support these discussions and allows the choices to be informed, we need to simulate electro-nuclear scenarios with high level of precision. I will present the process used to simulate such scenarios and described the two key model needed : irradiation model and fuel creation model. For each of these two models, I will present what physics problem they have to solve and why the standard numerical meta-models are not sufficient and we would need in order to continue the development of our studies.
We investigate the possibility of flexible and adaptive settings of data acquisition for nuclear physics experiments(gamma-ray spectroscopy). The goal is to adapt rapidly the event selection settings for up to a few hundred detectors, knowing that the configuration changes with the experimental setup and that the behaviour of the electronics changes with the experimental conditions: we need to implement a feedback by learning from experimental data. This feedback should at least update the parameters of the online calculations, and possibly the algorithms.
The Large Hadron Collider at CERN, where the Higgs boson has been discovered, is poised for a major upgrade for possible discovery of new particles, super-symmetric particles, dark matter or signs of extra-dimensions of space. The increase in yearly number of proton collisions recorded comes at a cost of a large increase of the recorded event complexity. Preliminary studies show that traditional algorithms suffer from a combinatorial explosion of the CPU time.
To reach out for Computer Science specialists, a Tracking Machine Learning challenge (trackML) is being set up for 2018.
Impact craters in planetary science are used to date and characterize planetary surfaces and study the geological history of planets. It is therefore an important task which traditionally has been achieved by means of visual inspection of images. This talk will present a currently ongoing RAMP challenge with the goal to predict the location and size of craters on Mars based on a satellite image, and the pipeline that was set up to tackle this problem.
Current and future generation of large scale astronomical surveys will have to deal with an increasing number of crowded fields due to their sensitivity. In such fields, a high number of objects (mostly galaxies) are "blended" together, which poses a challenge in terms of photometry (measure of individual fluxes) and morphology (shape measurements), both heavily related to the main science goals. This ramp would explore the use of deep learning techniques to tackle the deblending of galaxies (detection + segmentation) and ideally the measurement of their morphologies (regression).
Recently, there is a new phenomena of fake news and alternative facts. No one deliberately and consciously desires false information but paradoxically, people pervasively consume fake information. It is essential to restore people's confidence in reliable, fact-checking sources, and reduce media bias whether perceived or real. We believe that exploring artificial intelligence technologies could be leveraged to combat fake news and partly automate fact checking. We dive in American politics and propose a starting point for fake news detection.
Study of turbulence using numerical tools such as Direct Numerical Simulations, Large Eddy Simulations, etc.
leads to an analysis of large amount of data to discover the underlying physical mechanisms. Estimation of a
turbulent flow using few sensor measurements is yet another important problem in fluid mechanics, which also
involves big data analysis. In this project, we analyze turbulent channel flow using a sufficiently large numerical
dataset (around 9 TB) close to the wall region, with an aim of improving predictability and also understanding
the underlying physics.
Atherosclerosis is an inflammatory disease of the arterial wall caused by the formation of an atheroma plaque in the vessels wall. Data analysis of single point spectra and images led to the identification of some spectral changes due to the murine macrophages J774 enrichment with fatty acids. Data processing was held using Matlab environment and we are looking to re-process this dataset in Python environment.
The aim of (scientific) data ranking is to help users choose between alternative pieces of information especially when they are faced with huge amounts of data. However, ranking scientific data is a difficult task: various alternative quality criteria can be defined to order data items, depending on the data origin or even the way data have been obtained. As a consequence, it is very difficult to determine which ranking method (or which ranking criteria) to use. We present here a family of solutions named Rank Aggregation Techniques able to compute a consensus ranking from a set of input rankings. We will present the results obtained on our current applications.
We are open to any new collaboration with domain scientists having the need to make the most of alternative rankings.
During fertilization in mammals, the egg emits a series of calcium oscillations that are specific to each individual and whose frequency and amplitude are modulated by the culture medium in which it is cultured. We have developed advanced microfluidic techniques to record, stimulate and analyze the calcium response during the first hours of in-vitro fertilization (IVF) and have hundreds of individual data depending on the composition of the culture medium. The construction, as part of a collaborative project with CDS, of a prediction tool based on algorithms and a mathematical formalism of the functioning of the egg would open new perspectives of knowledge for developmental biology.
INRA/MaIAGE and INRA/BDR join their efforts to create efficient software for the early evaluation of embryo viability from time-lapse observation of embryos. Specifically, we are investigating the early development of bovine embryo which can be observed in 2D+Time light microscopy at different stages of the development. We have a database of hundreds of expert annotated embryos. More than one hundred qualitative and quantitative measures as well as original movies are available. A preliminary exploratory statistical analysis has been conducted and classification regression trees revealed discriminant features with respect to viability. Considering a restricted number of these features, we aim at their automatic evaluation from the movies.
This talk will present the ongoing preparation of a RAMP aiming at distinguishing subjects with Autism Spectrum Disorder (ASD) from typical control subjects. This analysis will use the Autism Brain Imaging Data Exchange (ABIDE I & II) database and data from Robert Debre Hospital based on R-fMRI and anatomical MRI. We will particularly focus on presenting the problematic, the typical pipeline answering this problem, and the current status of this RAMP.
This work is in collaboration with the Pasteur Institute (Neuroanatomy group of the Unit of Human Genetics and Cognitive Functions).
Recent methods for demographic history inference have achieved good results, circumventing the complexity of raw genomic data by summarizing them into handcrafted features called summary statistics. We developed a new approach based on deep learning that takes as input the variant sites found within a sample of individuals from the same population, and infers demographic descriptor values without relying on these predefined summary statistics. By letting our model choose how to handle raw data and learn its own way to embed them, we were able to outperform a method frequently used in population genetics for the inference of three out of seven demographic descriptor values of a scenario with a bottleneck and two expansions. This is still a preliminary work and we are hopeful that future developments would allow us to tackle a broader range of demographic scenarios and outperform previous methods by developing more flexible artificial neural network architectures.
Dans cet article, nous démontrons, au travers d’une expérimentation, une approche permettant de proposer des complétions d’une requête en cours de rédaction en exploitant de nombreux types d’autocomplétion et ce dans un contexte multi-services. Cette expérimentation s’appuie sur un éditeur SPARQL auquel nous avons ajouté des mécanismes d’autocomplétion qui supportent une ontologie en perpétuelle évolution, ici avec la base de connaissances collaborative de Wikidata.
In the setting of the CDS2 of University Paris Saclay, the Data IT platform is a prominent initiative for moving forward a creation of a linked open data cloud dedicated for data science. In order to take part of this challenging goal, we propose in this project to be involved in the enrichment of the datasets, existing in Data IT platform, and in proposing tools contributing to the improvement of the data and knowledge quality that are already available and those which will be accessible through Data IT platform.