Veratech for Health

How to implement OMOP CDM and not die trying?

Having real-world data provides an ideal framework for conducting clinical research studies. However, selecting and preparing these data entails significant effort for both researchers and IT personnel within the organization. Having large volumes of quality data is also the foundation of any AI use within the organization. It is estimated that 80% of researchers' time and 50% of IT personnel's time is spent on data selection and preparation. Using standards can minimize this effort, providing researchers with tools for querying and creating cohorts without the intervention of IT personnel.

Methodology

One of the most commonly used standardized models for real-world data research today is the Observational Medical Outcomes Partnership (OMOP CDM) model. In the European EHDEN project alone, there are currently 187 databases normalized in OMOP CDM, with 27 of them located in Spain and 4 in the Valencian Community.

The OMOP CDM standard provides a relational database model and a standard vocabulary model that allows for harmonizing data for easier reuse. The main effort required is to transform the data into the OMOP CDM standard from the data already recorded in existing EHR systems. The necessary steps to obtain a normalized database are as follows:

Analyze the meaning of the source data. Potential tables in the source containing patient demographic information, episodes/visits, conditions, procedures, medications, and clinical notes should be analyzed. In Spain, the CMBD is a very good source for populating an OMOP CDM, although almost any database available in the organization can be a source of information.

Understand and map local vocabularies and terminologies to international standards. If the organization already uses standard terminologies (e.g., ICD, SNOMED CT, or LOINC), this step is simpler. It is also necessary to be aware of the timeframe in which these terminologies were started to be used (e.g., transition from ICD-9 to ICD-10, date SNOMED CT started to be used in the organization, etc.). In the case of using local vocabularies, it will be necessary to define correspondences between different codes. This work has already been done for cases such as AEMPS drug codes.

Design and implement data extraction, transformation, and loading (ETL) programs. Different technologies can be used for this development, but in all cases, it is essential to document the logic of the ETL to ensure its maintenance and future scalability. It must also be decided whether the data will be loaded punctually or if support will be provided for incremental loads and updates. In that case, it should be taken into account that these processes are computationally intensive.

Normalization and data validation. Once the transformation and loading of the data into OMOP CDM are performed, it will be necessary to validate its quality to ensure that it has been transformed correctly and can be used for clinical research safely. If incidents are detected, all previous steps must be reviewed as the error may have occurred in any of them. A data table may not have been well interpreted, some codes may not have been properly mapped, or the implementation may be faulty.

Results

In the Valencian Community, we have collaborated with several organizations, applying this methodology for the construction and validation of OMOP CDM databases.

OMOP CDM Database	Organization	Number of patients
HULAFE	Hospital Universitario La Fe	2 274 159
Marina Salud Denia	Hospital de Denia Marina Salud	314 587
ABUCASIS	INCLIVA	4 014 819
VID-CONSIGN	FISABIO	1 964 588

OMOP CDM Normalized Databases in the Valencian Community

The main dataset for populating an OMOP CDM database usually comes from relational databases and structured data (XML or JSON). However, OMOP CDM can also be populated from other data sources such as free text and medical imaging. At Veratech, we have participated in several projects that address these domains, such as the ChronicExtract project, where an OMOP CDM database has been populated with information from diabetic patients contained in narrative clinical notes. This project aims to develop a dashboard for diabetic patients where the OMOP CDM database centralizes all clinical information. Some relevant data is found exclusively within narrative clinical notes. It was necessary to use natural language processing techniques to find mentions of relevant clinical concepts. The mentions found were subsequently represented using OMOP CDM tables and vocabulary. Another source of information for training predictive models is clinical imaging. The OMOP CDM model has the radiology extension, which allows linking observational data from EHRs with medical image metadata. Veratech has participated in the Tartaglia project, where this extension has been used as the basis for training models with image and clinical variables.

Conclusions and Future Work

Normalization to OMOP CDM provides advantages to clinical research, such as providing clear semantics to the data and improving its quality. It is true that the initial effort to perform this normalization is considerable, but once done, the advantages are evident. For each new clinical research, we will not have to dedicate time to data preparation and cleaning. OMOP CDM has the ATLAS environment that allows healthcare professionals to create patient cohorts from filters on the information stored in the database, without requiring intervention from IT personnel.

Normalization to OMOP CDM is also an opportunity to extract existing knowledge from free text clinical documents and stored images. Processes can be implemented to analyze this data to extract or annotate clinically relevant concepts about patients' health.

Finally, if OMOP CDM expands to more hospitals and care centers, we will have a unique opportunity to create a federated research network on real-world data based on OMOP CDM in the Valencian Community. By sharing the same information base, multicenter clinical studies can be carried out, sharing even queries and parameter definitions for the construction of research cohorts. And this can be done not only at the regional level but also allow participation in national and international research with minimal effort for managing clinical data.

Authors

Beatriz Navarro Ventura, Data Scientist at Veratech for Health S.L.
Diego Boscá Tomás, Ph.D. Semantic Interoperability Consultant at Veratech for Health S.L.
David Moner Cano, Ph.D. Semantic Interoperability Consultant at Veratech for Health S.L.