| ||||
| ||||
![]() Title:Natural Language Processing ETL Pipeline for OMOP Data Generation from Free-Text Clinical Case Reports Authors:Thomas Rowlands, Esmond Urwin, Phil Quinlan, Nona Naderi, Anais Mottaz, Patrick Ruch, Basel Alshaikhdeeb, Venkata Satagopam and Tim Beck Conference:IEEE CBMS 2026 Tags:Case Report, Named Entity Recognition, Natural Language Processing and OMOP CDM Abstract: Synthetic data represented in the OMOP common data model (CDM) offers a safe foundation for research because it preserves the structure and clinical relationships of real-world datasets without exposing identifiable patient information. Synthetic data enables reproducible research, where datasets can be shared and used for benchmarking algorithms, and analytics solutions can be prototyped before deployment with real data. However, there are limitations with the use of synthetic data, such as a lack of clinical complexity, including atypical care pathways and rare comorbidities. Published case reports provide detailed descriptions of anonymised patient presentations, diagnoses, and outcomes, addressing the limitations arising from the use of synthetic data alone. This study demonstrates the generation of OMOP data from published clinical case reports using Natural Language Processing (NLP) within an Extract, Transform, Load (ETL) pipeline. We created a novel case report corpus consisting of 118,653 open-access case report articles and associated supplementary files, and optimised for text processing. The MedCAT clinical NLP framework is used to extract and link medical concepts from free-text in the case reports to standard clinical terminologies. Using a purpose-built Python ETL package to map clinical terms to OMOP concepts and populate the clinical tables of the CDM, we generated an OMOP dataset of 83,290 patients with 1,490,318 condition records, 670,551 procedure records and 384,061 drug exposure records. Since the OMOP dataset is based on anonymised patients, it can be reused without restrictions. Future work will involve the evaluation of MedCAT model performance metrics on case report data and improving the accuracy of extracting clinical data and temporal events from case reports. Natural Language Processing ETL Pipeline for OMOP Data Generation from Free-Text Clinical Case Reports ![]() Natural Language Processing ETL Pipeline for OMOP Data Generation from Free-Text Clinical Case Reports | ||||
| Copyright © 2002 – 2026 EasyChair |
