Tartaglia: Exploitation and Extension of the OMOP Common Data Model to Drive Artificial Intelligence and Precision Medicine
In Europe and Spain, large amounts of health data are stored, although they are often scattered across isolated systems, making their effective use in clinical research and the development of artificial intelligence (AI) models difficult. To address this situation, the main challenge is to build a collaborative and secure environment that enables the sharing of health data in a joint manner, ensuring that the data remains within healthcare facilities, thus maintaining their autonomy. In the AI field, this is embodied in federated learning, where various data sources collaborate while preserving data privacy by allowing the development of models without the need to transfer data outside the centers. This approach has the potential to drive innovation in AI by providing a larger amount of training data, enhance companies' competitiveness, promote the creation of specialized jobs, and comply with regulations such as GDPR (General Data Protection Regulation).
What is TartaglIA?
TartaglIA is a federated artificial intelligence network designed to enhance clinical and healthcare research in Spain. This initiative is driven and led by GMV and is part of the R&D Missions in Artificial Intelligence program within the Spain Digital 2025 agenda and the National Artificial Intelligence Strategy, funded by the European Union through the Next Generation EU funds. It is made up of a consortium of 15 other public and private entities, including Veratech, and aims to accelerate the use of AI in various medical fields.
One of its main goals is to optimize the training of mathematical models that support clinical decision-making, in order to promote personalized and precision medicine.
Among its most notable applications, TartaglIA focuses on diagnostics in four key areas: Alzheimer's, prostate cancer, diabetes, and complex chronic diseases. It uses advanced AI techniques to guide the acquisition of diagnostic-quality ultrasound medical imaging, as well as to analyze large volumes of clinical data, both structured and unstructured. The federated network allows data-owning entities to collaborate without compromising the security and privacy of their data. Each center has a computing node within the Federated Learning Network. It is the Federated Network that moves the learning models to these nodes where the data is stored, not the other way around, following the procedure shown in the following image:
What is Veratech's role in the TartaglIA project?
For artificial intelligence models to be trained effectively and efficiently, the data at each node must be harmonized or standardized according to a common format. This ensures that, even though the data is located in different places, it can be accessed and interpreted consistently and uniformly by the AI models both during the training phase and in clinical use.
If the data at each node does not conform to a common data model, the training process becomes significantly more complicated, as it has to work with various formats, and even more challenging, with different semantics. This can lead to incompatibilities or inconsistencies when analyzing the different datasets, resulting in less accurate or useful outcomes. Therefore, harmonization is crucial for federated learning to function properly and to produce robust models.
Within the framework of the TartaglIA project, Veratech, it is the leader of the work package focused on researching methods and techniques to facilitate the creation of standardized data repositories based on the common data model OMOP (OMOP CDM) from heterogeneous clinical data. Research has been conducted on methods for converting data represented in other Electronic Health Record standards, such as OpenEHR, to OMOP. Furthermore, since there was a desire to load unstructured data (images) into OMOP CDM, research has been carried out on a new extension of OMOP CDM in collaboration with FISABIO to store different imaging modalities and their metadata. This work has been addressed by incorporating two new image tables into the common data model of OMOP. Additionally, it has supported the development of the study and application of health standards such as DICOM and MIDS as intermediate steps for the subsequent loading of these images and their metadata into the Extended OMOP CDM.
Todas estas investigaciones se han puesto en práctica en los paquetes de trabajo en los que se ha trabajado para armonizar datos clínicos de interés de pacientes con cáncer de próstata, Alzheimer o retinopatía diabética. También se ha armonizado metainformación de imágenes de resonancia magnética de próstata, de anatomía patológica de biopsias de próstata, resonancia magnética de cerebro y retinografías de ojo.
Thus, in these work packages, Veratech has carried out the harmonization of the various data sources to the common data model (OMOP CDM), which originates from the repositories of the data-providing partners in each case. As mentioned, the data sources are heterogeneous and present differences in structure, granularity, terminology, and semantics. Harmonizing clinical data and images is essential for the subsequent training of models for each use case. Initially, a preliminary research effort was necessary to create normalization guidelines for the standardization of the clinical variables of interest, their units, and their allowable ranges. Nonetheless, the majority of the work has consisted of creating complex ETL (Extract, Transform, Load) processes, especially in data transformation and the understanding and application, for the first time, of the OMOP CDM image extension. Subsequently, the data has been loaded locally into the OMOP CDM at each of the data-providing centers for the different work packages. Finally, the queries that retrieve variables and metadata of interest have been specified according to the AI model to be trained. The result of these queries serves as input to the training processes of the AI models.
The scheme followed in TartaglIA is shown below:
Thus, the TartaglIA project has represented a significant advancement in clinical research through the use of health data standards and artificial intelligence techniques in a federated environment. The creation of decision support tools is crucial for improving the quality and safety of healthcare. Furthermore, TartaglIA has fostered collaboration among various national institutions of different types, driving innovation in health.
At Veratech, we are pleased to have been able to collaborate on this innovative project by contributing our knowledge and experience in researching and harmonizing existing data sources in organizations to the extended OMOP CDM with medical imaging data. Through our work, we have added another success story through the use of OMOP CDM and the federated artificial intelligence network that it enables.
El Proyecto Red Federada de Inteligencia Artificial para acelerar la Investigación Sanitaria (TSI-100205-2021-17), ha sido financiado por el Ministerio para la Transformación Digital y de la Función Pública, mediante el Programa Misiones de I+D en Inteligencia Artificial 2021, en el marco de la Agenda España Digital 2025 y de la Estrategia Nacional de Inteligencia Artificial, con financiación europea a través del Plan de Recuperación, Transformación y Resiliencia.