Hospital Charges for Inpatients¶

MEF University - BDA 507 Introduction to Computer Programming (Python) - Term Project¶

Emre Kemerci - 311802017¶

January 2019¶

Abstract¶

This study is conducted as term project of BDA 507 - Introduction to Computer Programming lecture of Fall 2018 Term in MEF University. The aim of this study is to perform basic explanatory data analysis of hospital-specific charges for the more than 3,000 U.S. hospitals based on a rate per discharge using the top 100 Medicare Severity Diagnosis Related Group (MS-DRG) using Python language. My object is to reach a descriptive conclusion for average payments made for these 100 diagnosis, relative provider(hospital) reported the expense and their location based on cities and states.

Explanation of the Data¶

The dataset is owned by the US government Centers for Medicare & Medicaid Services (CMS) and freely available on CMS web site. In this study, the data were exported from Kaggle.com

CMS, The Centers for Medicare & Medicaid Services, is part of the Department of Health and Human Services (HHS) which aims to "protect the health of all Americans and provide essential human services, especially for those who are least able to help themselves."

This dataset contains hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments paid under Medicare based on a rate per discharge using the top 100 Medicare Severity Diagnosis Related Group (MS-DRG). It shows how price for the same diagnosis and the same treatment and in the same city can vary differently across different providers in US.

Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.

The following variables are included in the data:

DRG Definition: The code and description identifying the MS-DRG. MS-DRGs are a classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Provider Id: The CMS Certification Number (CCN) assigned to the Medicare certified hospital facility.
Provider Name: The name of the provider.
Provider Street Address: The provider’s street address.
Provider City: The city where the provider is located.
Provider State: The state where the provider is located.
Provider Zip Code: The provider’s zip code.
Provider HRR: The Hospital Referral Region (HRR) where the provider is located.
Total Discharges: The number of discharges billed by the provider for inpatient hospital services. In other words, the total discharges indicate the number of beneficiaries who were released from the inpatient hospital after receiving care.
Average Covered Charges: The provider's average charge for services covered by Medicare for all discharges in the MS-DRG. These will vary from hospital to hospital because of differences in hospital charge structures. “Average Charges” refers to what the provider bills to Medicare.
Average Total Payments: The average total payments to all providers for the MS-DRG including the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Also included in average total payments are co-payment and deductible amounts that the patient is responsible for and any additional payments by third parties for coordination of benefits. In other words; “Average Total Payments” refers to what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits. The provider has an agreement with Medicare to accept Medicare’s payment and the difference between what the provider charges and Medicare pays is not paid by Medicare or any other entity, including the beneficiary.
Average Medicare Payments: The average amount that Medicare pays to the provider for Medicare's share of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.

Data Explore & Preparation¶

First, start with loading the necessary packages

import os.path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore") #to disable warning messages

I copied the raw data to my Github page beforehand and so I started with exporting it from there.

# Downloading CSV files from the web
from urllib.request import urlretrieve
website = "https://github.com/EmreKemerci/EmreKemerci/blob/master/bda507/inpatientCharges.csv?raw=true"
urlretrieve(website, "inpatientCharges.csv")
df = pd.read_csv("inpatientCharges.csv")

df.dtypes

DRG Definition                          object
Provider Id                              int64
Provider Name                           object
Provider Street Address                 object
Provider City                           object
Provider State                          object
Provider Zip Code                        int64
Hospital Referral Region Description    object
 Total Discharges                        int64
 Average Covered Charges                object
 Average Total Payments                 object
Average Medicare Payments               object
dtype: object

df.shape

(163065, 12)

df.head(5)

df.tail(2)

The data have 12 columns and 163065 rows/observations. At the first look it looks like some variables' data type are need to be changed, some column names have spaces.

In order to re-state column names:

df.columns= ['DRGDef', "ProviderID",'ProviderName', "ProviderAddress","ProviderCity","ProviderState","ProviderZipCode","Region","TotalDischarges","AvrCoveredCharges","AvrTotalPayments","AvrMedicarePayments"]

df.head(1)

df.AvrCoveredCharges = df.AvrCoveredCharges.str.replace('[^\d\.]', '').astype(float)
df.AvrTotalPayments = df.AvrTotalPayments.str.replace('[^\d\.]', '').astype(float)
df.AvrMedicarePayments = df.AvrMedicarePayments.str.replace('[^\d\.]', '').astype(float)

df.dropna() # remove NA columns if any. 
df.shape

(163065, 12)

df.dtypes

DRGDef                  object
ProviderID               int64
ProviderName            object
ProviderAddress         object
ProviderCity            object
ProviderState           object
ProviderZipCode          int64
Region                  object
TotalDischarges          int64
AvrCoveredCharges      float64
AvrTotalPayments       float64
AvrMedicarePayments    float64
dtype: object

df.head(5)

df.describe()

When we check the numerical variables, we see that per diagnosis and per provider, avergare discharges are 43 while min is 11 and max is 3383 with standart deviation of 51. Besides average charge of provider are USD 36,133 while min is USD 2459 and max is USD 929118 with standart deviation of 35065.

categorical = df.dtypes[df.dtypes == "object"].index
df[categorical].describe()

At the categorical variables, we see that we have 100 unique diagnosises and 3201 providers which are located in 1977 cities, 306 region and in 51 states. US have 50 states, so it is worth to check is there any anomaly in state list.

df.ProviderState.unique()

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'TX', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM',
       'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

The difference is coming from "DC". Since its capital city it counted as state in this dataset.

ANALYSIS¶

Density of States¶

plt.figure(figsize=(16,10), dpi= 80)
sns.countplot(x="ProviderState", data=df,order =df['ProviderState'].value_counts().index)
plt.xlabel('States')
plt.ylabel('Frequency')
plt.title('# of observations per State')
plt.xticks(rotation=90)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]),
 <a list of 51 Text xticklabel objects>)

CA-California is the top state showing the most observations - combination of diagnosis and provider while AK-Alaska is the lowest one.

CA, TX, FL and NY is the group distinguishing form others in terms of number of observations.

plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['ProviderState'] == "CA", "AvrTotalPayments"], shade=True, color="g", label="CA", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "TX", "AvrTotalPayments"], shade=True, color="red", label="TX", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "FL", "AvrTotalPayments"], shade=True, color="orange", label="FL", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "NY", "AvrTotalPayments"], shade=True, color="grey", label="NY", alpha=.7)

plt.title('Density Plot', fontsize=12)
plt.xlabel('Average Total Payment Amount')
plt.ylabel('Density')

Text(0,0.5,'Density')

It is seen that average total payments density is similar for these 4 countries showing highest frequency.

Average Total Payments per Diagnosis and Provider Location¶

I started with analysing the average total payments for each diagnosis.

df1 = df.groupby('DRGDef',as_index=False)[['AvrTotalPayments']].mean()

plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='DRGDef', y='AvrTotalPayments', 
            data=df1.sort_values('AvrTotalPayments', ascending=False))

plt.title('Mean Average Total Payments per Diagnosis', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=8, rotation=90)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
        85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
 <a list of 100 Text xticklabel objects>)

And, also draw the bar graph of Average Total Payments for each state.

df2 = df.groupby('ProviderState',as_index=False)[['AvrTotalPayments']].mean()

plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderState', y='AvrTotalPayments', 
            data=df2.sort_values('AvrTotalPayments', ascending=False))

plt.title('Mean Average Total Payments per State', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]),
 <a list of 51 Text xticklabel objects>)

I also want to figure how each diagnosis priced in each state.

df = df.assign(DRGCode=df.DRGDef.str[:3])

df6 = df.loc[:, ['DRGCode', 'ProviderState', 'AvrTotalPayments' ]]

df6= df6.groupby(['ProviderState', 'DRGCode' ],as_index=False)[['AvrTotalPayments']].mean()

df6 = df6.pivot_table('AvrTotalPayments', ['DRGCode'], 'ProviderState') #convert rows to columns

df6 = df6.fillna(0) # convert NaN to 0

df6_norm = (df6 - df6.min()) / (df6.max() - df6.min()) #normalization

plt.figure(figsize=(20,15), dpi= 80)
sns.heatmap(df6_norm, square=False, cbar=True, cmap="coolwarm")
plt.title("Average Total Payment (Normalized) per ProviderState and DRG Code", fontsize=12)

Text(0.5,1,'Average Total Payment (Normalized) per ProviderState and DRG Code')

The top 5 expensive DRG Definition- diagnosis are;

SEPTICEMIA OR SEVERE SEPSIS (blood poisoning, especially that caused by bacteria or their toxins OR serious infection that causes your immune system to attack your body)
INFECTIOUS & PARASITIC DISEASES
RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS
MAJOR SMALL & LARGE BOWEL PROCEDURES
SPINAL FUSION EXCEPT CERVICAL W/O MCC

while the chepaest ones are;

CHEST PAIN
CARDIAC ARRHYTHMIA & CONDUCTION DISORDER

and the most expensive states are;

AK Alaska
DC Washington DC
HI Hawaii
CA California
MD Maryland

while the cheapest states are;

AL Alabama
WV West Virginia

While Alaska has the lowest frequency of observations, it is the most expensive state.

When we scan the heatmap from left to right, it looks like DRGs (from blue to red, gets expensive) are not significantly deviates across states.

Continue to analysis with investigating states by citites for the most expensive and cheap states as well as the most expensive diagnosis.

df3 = df[(df.ProviderState == 'AK')]
       # | (df.ProviderState == 'DC') | (df.ProviderState == 'HI') | (df.ProviderState == 'CA') | (df.ProviderState == 'MD')

df3 = df3.groupby(['ProviderState','ProviderCity'],as_index=False)[['AvrTotalPayments']].mean()

plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderCity', y='AvrTotalPayments', hue="ProviderState",
            data=df3.sort_values('AvrTotalPayments', ascending=False),color="navy", saturation=90)
plt.title('Mean Average Total Payments per City of Alaska', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)

(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text xticklabel objects>)

df4 = df[(df.ProviderState == 'AL')]
       # | (df.ProviderState == 'DC') | (df.ProviderState == 'HI') | (df.ProviderState == 'CA') | (df.ProviderState == 'MD')

df4 = df4.groupby(['ProviderState','ProviderCity'],as_index=False)[['AvrTotalPayments']].mean()

plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderCity', y='AvrTotalPayments', hue="ProviderState",
            data=df4.sort_values('AvrTotalPayments', ascending=False), color="navy", saturation=90)
plt.title('Mean Average Total Payments per City of Alabama', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69, 70, 71, 72, 73, 74, 75, 76]),
 <a list of 77 Text xticklabel objects>)

Alaska, the most expensive state on average, have 7 cities. And average total payment in "Fairbank" is the highest one. In Alabama, being the cheapest on average, Phenix City is outlier in state with average total payment over 11k, expensive as much as moderate cities in Alaska.

df5 = df[df['DRGDef'].str.contains("870")]

df5 = df5.groupby(['ProviderCity','ProviderState'],as_index=False)['AvrTotalPayments'].mean()

plt.figure(figsize=(16,100), dpi= 80)
sns.barplot(y='ProviderCity', x='AvrTotalPayments',
            data=df5.sort_values('AvrTotalPayments', ascending=False))
plt.title('Mean Average Total Payments for DRG def "870-Septicemia or Severe Sepsis" per State', fontsize=12)
plt.ylabel('Provider City', fontsize=8)
plt.xticks(fontsize=12, rotation=90)

(array([     0.,  20000.,  40000.,  60000.,  80000., 100000., 120000.,
        140000.]), <a list of 8 Text xticklabel objects>)

Taking the Septicemia or Severe Sepsis diagnosis on hand, in case we plot the cities in US, it is seen that Valhalla is the most expensive city being almost 6 times expensive than Canonsburg.

Conclusion¶

As a result of analysis of average total payment for top 100 Medicare Severity Diagnosis Related Group reported by 3201 provider in 1,977 cities of US where belongs to 306 region and 51 States in US.

Average charge of provider are USD 36,133 while min is USD 2,459 and max is USD 929,118 with standart deviation of 35,065.
California has the highest density in terms of combination of diagnosis and provider while AK-Alaska is the lowest one. California, Texsas, Florida and NewYork are the states distinguishing from other states with their high density. This means that either their population so number of diagnosis are high or number of provider/hospital are high or both of them. And when we look at Average Total Payment density of these four states, it is seen that states are almost identicial, having density around USD 0 - 20K.
Regarding the average total payments per diagnosis and provider location;
- There are 4 diagnosis seperated from others in terms of total payment over 30K. These are "Septicemia or Severe Sepsis" - this is a king of blood poisoning, "Infectious & Parasitic Diseases", "Respiratory System Diagnosis with Ventilator Support of over 96 Hours" and "Major Small & Large Bowel Procedures".
- Considering the average total payments, the most expensive states are Alaska, Washington DC, Hawaii, California, Maryland respectively. Besides, Alabama and West Virginia are the cheapest ones.
- Alaska has the lowest frequency of observations and the most expensive state.
- Average total payment for each diagnosis are not significantly deviates according to state averages.
When we breakdown to cities from states for the most expensive one (Alaska) and the cheapest one (Alabama);
- Alaska, the most expensive state on average, have 7 cities. And average total payment in "Fairbank" is the highest one. In Alabama, being the cheapest on average, Phenix City is well ahead in state with average total payment over 11k, expensive as much as moderate cities in Alaska.
- Taking the Septicemia or Severe Sepsis diagnosis on hand for all cities, Valhalla is the most expensive city being almost 6 times expensive than Canonsburg, the cheapest city.

References¶

https://www.kaggle.com/speedoheck/inpatient-hospital-charges/home

https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Inpatient_Outpatient_FAQ.pdf

https://data.cms.gov/Medicare-Inpatient/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3

https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

https://www.kaggle.com/mlesna/data-analysis-and-visualizations

https://seaborn.pydata.org/tutorial/color_palettes.html

	DRG Definition	Provider Id	Provider Name	Provider Street Address	Provider City	Provider State	Provider Zip Code	Hospital Referral Region Description	Total Discharges	Average Covered Charges	Average Total Payments	Average Medicare Payments
0	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10001	SOUTHEAST ALABAMA MEDICAL CENTER	1108 ROSS CLARK CIRCLE	DOTHAN	AL	36301	AL - Dothan	91	$32963.07	$5777.24	$4763.73
1	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10005	MARSHALL MEDICAL CENTER SOUTH	2505 U S HIGHWAY 431 NORTH	BOAZ	AL	35957	AL - Birmingham	14	$15131.85	$5787.57	$4976.71
2	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10006	ELIZA COFFEE MEMORIAL HOSPITAL	205 MARENGO STREET	FLORENCE	AL	35631	AL - Birmingham	24	$37560.37	$5434.95	$4453.79
3	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10011	ST VINCENT'S EAST	50 MEDICAL PARK EAST DRIVE	BIRMINGHAM	AL	35235	AL - Birmingham	25	$13998.28	$5417.56	$4129.16
4	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10016	SHELBY BAPTIST MEDICAL CENTER	1000 FIRST STREET NORTH	ALABASTER	AL	35007	AL - Birmingham	18	$31633.27	$5658.33	$4851.44

	DRG Definition	Provider Id	Provider Name	Provider Street Address	Provider City	Provider State	Provider Zip Code	Hospital Referral Region Description	Total Discharges	Average Covered Charges	Average Total Payments	Average Medicare Payments
163063	948 - SIGNS & SYMPTOMS W/O MCC	670060	TEXAS REGIONAL MEDICAL CENTER AT SUNNYVALE	231 SOUTH COLLINS ROAD	SUNNYVALE	TX	75182	TX - Dallas	11	$28873.09	$7663.09	$6848.54
163064	948 - SIGNS & SYMPTOMS W/O MCC	670068	TEXAS HEALTH PRESBYTERIAN HOSPITAL FLOWER MOUND	4400 LONG PRAIRIE ROAD	FLOWER MOUND	TX	75028	TX - Dallas	12	$15042.00	$3539.75	$2887.41

	DRGDef	ProviderID	ProviderName	ProviderAddress	ProviderCity	ProviderState	ProviderZipCode	Region	TotalDischarges	AvrCoveredCharges	AvrTotalPayments	AvrMedicarePayments
0	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10001	SOUTHEAST ALABAMA MEDICAL CENTER	1108 ROSS CLARK CIRCLE	DOTHAN	AL	36301	AL - Dothan	91	32963.07	5777.24	4763.73
1	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10005	MARSHALL MEDICAL CENTER SOUTH	2505 U S HIGHWAY 431 NORTH	BOAZ	AL	35957	AL - Birmingham	14	15131.85	5787.57	4976.71
2	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10006	ELIZA COFFEE MEMORIAL HOSPITAL	205 MARENGO STREET	FLORENCE	AL	35631	AL - Birmingham	24	37560.37	5434.95	4453.79
3	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10011	ST VINCENT'S EAST	50 MEDICAL PARK EAST DRIVE	BIRMINGHAM	AL	35235	AL - Birmingham	25	13998.28	5417.56	4129.16
4	039 - EXTRACRANIAL PROCEDURES W/O CC/MCC	10016	SHELBY BAPTIST MEDICAL CENTER	1000 FIRST STREET NORTH	ALABASTER	AL	35007	AL - Birmingham	18	31633.27	5658.33	4851.44

	ProviderID	ProviderZipCode	TotalDischarges	AvrCoveredCharges	AvrTotalPayments	AvrMedicarePayments
count	163065.000000	163065.000000	163065.000000	163065.000000	163065.000000	163065.000000
mean	255569.865428	47938.121908	42.776304	36133.954224	9707.473804	8494.490964
std	151563.671767	27854.323080	51.104042	35065.365931	7664.642598	7309.467261
min	10001.000000	1040.000000	11.000000	2459.400000	2673.000000	1148.900000
25%	110092.000000	27261.000000	17.000000	15947.160000	5234.500000	4192.350000
50%	250007.000000	44309.000000	27.000000	25245.820000	7214.100000	6158.460000
75%	380075.000000	72901.000000	49.000000	43232.590000	11286.400000	10056.880000
max	670077.000000	99835.000000	3383.000000	929118.900000	156158.180000	154620.810000

	DRGDef	ProviderName	ProviderAddress	ProviderCity	ProviderState	Region
count	163065	163065	163065	163065	163065	163065
unique	100	3201	3326	1977	51	306
top	194 - SIMPLE PNEUMONIA & PLEURISY W CC	GOOD SAMARITAN HOSPITAL	100 MEDICAL CENTER DRIVE	CHICAGO	CA	CA - Los Angeles
freq	3023	633	183	1505	13064	3653