Identifying Deviations from Usual Medical Care using a Statistical Approach

Developing methods to detect deviations from usual medical care may be useful in the development of automated clinical alerting systems to alert clinicians to treatment choices that warrant additional consideration. We developed a method for identifying deviations in medication administration in the intensive care unit that is based on learning logistic regression models from past patient data that when applied to current patient data identifies statistically unusual treatment decisions. The models predicted a total of 53 deviations for 6 medications on a set of 3000 patient cases. A set of 12 predicted deviations and 12 non-deviations was evaluated by a group of intensive care physicians. Overall, the predicted deviations were assessed to often warrant an alert and to be clinically useful, and furthermore, the frequency with which such alerts would be raised is not likely to be disruptive in a clinical setting.

Introduction

The rising deployment of electronic medical records makes it feasible to construct statistical models of usual patient care in a given clinical setting. Such models of care can be used to determine if the management of a current patient case is unusual in some way. If so, an alert can be raised. While unusual management may be intended and justified, it sometimes may indicate suboptimal care that can be modified in time to help the patient. We are investigating the extent to which alerts of unusual patient management can be clinically helpful. This paper describes a laboratory-type (offline) study of alerting. In particular, it describes a study of alerts that were raised for ICU patients who were expected to receive particular medications but did not, according to statistical models that were constructed from past ICU patients.

Background

Characterization of deviations and their identification have been studied in several domains, such as identification of fraudulent credit card transactions, identification of network intrusions, and characterization of aberrations in medical data (1). In healthcare, rule-based expert systems are a commonly used computerized method for identifying deviations and errors. We propose that statistical methods can be a complementary approach to rule-based expert systems for identifying deviations (2). Rule-based methods excel when expected patterns of care are well established, considered important, and can be feasibly codified. Statistical methods have the potential to “fill in” many additional expected patterns of care that are not as well established, are complex to codify, or both.

Rule-based expert systems apply a knowledge base of rules to patient data. The advantage of rule-based systems is that they are based on established clinical knowledge, and thus, are likely to be clinically useful. In addition, such rules are relatively easy to automate and can be readily applied to patient data that are available in electronic form. Rule-based systems have been developed and deployed for medication decision support (e.g., automated dosing guidelines and identifying adverse drug interactions), monitoring of treatment protocols for infectious diseases, identification of clinically important events in the management of chronic conditions such as diabetes (3), as well as other tasks. However, hand-crafted rule-based systems have several disadvantages. The creation of rules requires input from human experts, which can be tedious and time consuming. In addition, rules typically have limited coverage of the large space of possible adverse events, particularly more complex adverse events.

We can apply statistical methods to identify anomalous patterns in patient data, such as laboratory tests or treatments that are statistically highly unusual with respect to past patients with the same or similar conditions. A deviation is also known as an outlier, an anomaly, an exception, or an aberration. The basis of such an approach is that (1) past patient records stored in electronic medical records reflect the local standard of clinical care, (2) events (e.g., a treatment decisions) that deviate from such standards can often be identified, and (3) such outliers represent events that are unusual or surprising as compared to previous comparable cases, and may indicate patient-management errors. This approach has several advantages. It does not require expert input to build a detection system, clinically valid and relevant deviations are derived empirically using a large set of prior patient cases, the system can be periodically and automatically re-trained, and alert coverage can be broad and deep.

In this study, we develop and evaluate a statistical method for identifying deviations from expected medications for patients in the intensive care unit. In particular, we (1) developed logistic regression models for usual medication administration patterns during the first day of stay in a medical intensive care unit (ICU), (2) applied these models to identify medication omissions that were deemed by the models to be statistical deviations, and (3) estimated the clinical validity and utility of these deviations based on the judgments of a panel of critical care physicians.

Methods

Data

The data we used comes from the HIgh-DENsity Intensive Care (HIDENIC) dataset that contains clinical data on patients admitted to the ICUs at the University of Pittsburgh Medical Center. For our study, we used the HIDENIC data from 12,000 sequential patient admissions that occurred between July 2000 and December 2001. We developed probabilistic models that predict which medications will be administered to an ICU patient within the first 24 hours of stay in the ICU, and used those models to identify deviations from expected medication administration.

For predictors, we selected five variables whose values were available for all patients in the data at the time of admission to the ICU. These included the admitting diagnosis of the patient, the age of the patient at admission, the gender of the patient, the particular ICU where the patient is staying, and the Acute Physiological and Chronic Health Evaluation (APACHE III) score of the patient at admission. The admitting diagnosis for a patient was coded by the ICU physician or nurse as one of the diagnoses listed in the Joint Commission on Accreditation of Healthcare Organization’s (JCAHO’s) Specifications Manual for National Hospital Quality Measures-ICU (4). The APACHE III score was generated by an outcome prediction model and has been widely used for assessing the severity of illness in acutely ill patients in ICUs; it is based on measurements of 17 physiological variables, age, and chronic health status. The APACHE III score has a range from 0 to 299 and correlates with the patient’s risk of mortality. Three of the variables we used are categorical variables, namely, admitting diagnosis, gender of the patient, and the particular ICU, and the remaining two, namely, age and APACHE III, are continuous variables.

While the data for the five predictor variables were originally captured in coded form in the hospital’s ICU medical record system, the administered medications were entered into the medical record as free text that resulted in variations in the medication names. We pre-processed the medication entries to correct misspellings and to expand abbreviated names. We then mapped each medication name to the standard generic medication name obtained from the Federal Food and Drug Administration (FDA) Approved Drug Products with Therapeutic Equivalence Evaluations 26th Edition Electronic Orange Book. This process resulted in a total of 307 unique medications. For each patient admission, a medication indicator vector was created that contained the value 1 if the medication was administered during the first 24 hours of stay and the value 0 if not.

Statistical Models

The dataset of 12,000 patient admissions was temporally split into three sets of 6,000 training cases (cases 1 to 6,000 in chronological order), 3,000 validation cases (cases 6,001 to 9,000), and 3,000 test cases (cases 9001 to 12,000). Medications that were administered fewer than 20 times in the training and validation cases were excluded from the study, due to a sample size too small to support reliable model construction. After filtering such rarely used medications that constituted approximately 2.5% of all medication entries, 152 medications were included in the study. We applied logistic regression to the 6,000 training cases to model the relationship between the five predictor variables and the medications administered within the first 24 hours of stay. For each of the 152 medications, we constructed a distinct logistic regression model using the implementation in the machine-learning software suite Weka version 3.5.6 (University of Waikato, New Zealand) (5).

We then used a set of validation cases to identify models that were deemed to be reliable. Each of the 152 logistic regression models was applied to the 3,000 validation cases to derive the probability that the medication defined by the model was administered, and these probabilities were used to measure discrimination and calibration of the model. Discrimination measures how well a model differentiates between patients who had the outcome of interest from those who did not. Calibration assesses how close a model’s estimated probabilities are to the actual frequencies of the outcome. We used the area under the Receiver Operating Characteristic (ROC) curve (AUROC) to measure discrimination, and the Hosmer-Lemeshow Statistic (HLS) to measure calibration (6). A high AUROC indicates good discrimination. For HLS, a low p-value implies that the model is poorly calibrated, and a large p-value suggests that either the model is well-calibrated or that data is insufficient to tell if it is poorly calibrated.

The AUROC and the HLS p-values were computed using the procedures “roctab” and “hl”, respectively, in the statistical program Intercooled Stata version 8.0 (Stata Corporation, College Station, TX). We identified those medication models as reliable that had AUROC ≥ 0.80 and HLS p-value ≥ 0.05 on the validation cases. There were nine such models ( Figure 1 ).