This study is conducted as term project of BDA 507 - Introduction to Computer Programming lecture of Fall 2018 Term in MEF University. The aim of this study is to perform basic explanatory data analysis of hospital-specific charges for the more than 3,000 U.S. hospitals based on a rate per discharge using the top 100 Medicare Severity Diagnosis Related Group (MS-DRG) using Python language. My object is to reach a descriptive conclusion for average payments made for these 100 diagnosis, relative provider(hospital) reported the expense and their location based on cities and states.
The dataset is owned by the US government Centers for Medicare & Medicaid Services (CMS) and freely available on CMS web site. In this study, the data were exported from Kaggle.com
CMS, The Centers for Medicare & Medicaid Services, is part of the Department of Health and Human Services (HHS) which aims to "protect the health of all Americans and provide essential human services, especially for those who are least able to help themselves."
This dataset contains hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments paid under Medicare based on a rate per discharge using the top 100 Medicare Severity Diagnosis Related Group (MS-DRG). It shows how price for the same diagnosis and the same treatment and in the same city can vary differently across different providers in US.
Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.
The following variables are included in the data:
DRG Definition: The code and description identifying the MS-DRG. MS-DRGs are a classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Provider Id: The CMS Certification Number (CCN) assigned to the Medicare certified hospital facility.
Provider Name: The name of the provider.
Provider Street Address: The provider’s street address.
Provider City: The city where the provider is located.
Provider State: The state where the provider is located.
Provider Zip Code: The provider’s zip code.
Provider HRR: The Hospital Referral Region (HRR) where the provider is located.
Total Discharges: The number of discharges billed by the provider for inpatient hospital services. In other words, the total discharges indicate the number of beneficiaries who were released from the inpatient hospital after receiving care.
Average Covered Charges: The provider's average charge for services covered by Medicare for all discharges in the MS-DRG. These will vary from hospital to hospital because of differences in hospital charge structures. “Average Charges” refers to what the provider bills to Medicare.
Average Total Payments: The average total payments to all providers for the MS-DRG including the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Also included in average total payments are co-payment and deductible amounts that the patient is responsible for and any additional payments by third parties for coordination of benefits. In other words; “Average Total Payments” refers to what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits. The provider has an agreement with Medicare to accept Medicare’s payment and the difference between what the provider charges and Medicare pays is not paid by Medicare or any other entity, including the beneficiary.
Average Medicare Payments: The average amount that Medicare pays to the provider for Medicare's share of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.
First, start with loading the necessary packages
import os.path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore") #to disable warning messages
I copied the raw data to my Github page beforehand and so I started with exporting it from there.
# Downloading CSV files from the web
from urllib.request import urlretrieve
website = "https://github.com/EmreKemerci/EmreKemerci/blob/master/bda507/inpatientCharges.csv?raw=true"
urlretrieve(website, "inpatientCharges.csv")
df = pd.read_csv("inpatientCharges.csv")
df.dtypes
df.shape
df.head(5)
df.tail(2)
The data have 12 columns and 163065 rows/observations. At the first look it looks like some variables' data type are need to be changed, some column names have spaces.
In order to re-state column names:
df.columns= ['DRGDef', "ProviderID",'ProviderName', "ProviderAddress","ProviderCity","ProviderState","ProviderZipCode","Region","TotalDischarges","AvrCoveredCharges","AvrTotalPayments","AvrMedicarePayments"]
df.head(1)
df.AvrCoveredCharges = df.AvrCoveredCharges.str.replace('[^\d\.]', '').astype(float)
df.AvrTotalPayments = df.AvrTotalPayments.str.replace('[^\d\.]', '').astype(float)
df.AvrMedicarePayments = df.AvrMedicarePayments.str.replace('[^\d\.]', '').astype(float)
df.dropna() # remove NA columns if any.
df.shape
df.dtypes
df.head(5)
df.describe()
When we check the numerical variables, we see that per diagnosis and per provider, avergare discharges are 43 while min is 11 and max is 3383 with standart deviation of 51. Besides average charge of provider are USD 36,133 while min is USD 2459 and max is USD 929118 with standart deviation of 35065.
categorical = df.dtypes[df.dtypes == "object"].index
df[categorical].describe()
At the categorical variables, we see that we have 100 unique diagnosises and 3201 providers which are located in 1977 cities, 306 region and in 51 states. US have 50 states, so it is worth to check is there any anomaly in state list.
df.ProviderState.unique()
The difference is coming from "DC". Since its capital city it counted as state in this dataset.
plt.figure(figsize=(16,10), dpi= 80)
sns.countplot(x="ProviderState", data=df,order =df['ProviderState'].value_counts().index)
plt.xlabel('States')
plt.ylabel('Frequency')
plt.title('# of observations per State')
plt.xticks(rotation=90)
CA-California is the top state showing the most observations - combination of diagnosis and provider while AK-Alaska is the lowest one.
CA, TX, FL and NY is the group distinguishing form others in terms of number of observations.
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['ProviderState'] == "CA", "AvrTotalPayments"], shade=True, color="g", label="CA", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "TX", "AvrTotalPayments"], shade=True, color="red", label="TX", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "FL", "AvrTotalPayments"], shade=True, color="orange", label="FL", alpha=.7)
sns.kdeplot(df.loc[df['ProviderState'] == "NY", "AvrTotalPayments"], shade=True, color="grey", label="NY", alpha=.7)
plt.title('Density Plot', fontsize=12)
plt.xlabel('Average Total Payment Amount')
plt.ylabel('Density')
It is seen that average total payments density is similar for these 4 countries showing highest frequency.
I started with analysing the average total payments for each diagnosis.
df1 = df.groupby('DRGDef',as_index=False)[['AvrTotalPayments']].mean()
plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='DRGDef', y='AvrTotalPayments',
data=df1.sort_values('AvrTotalPayments', ascending=False))
plt.title('Mean Average Total Payments per Diagnosis', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=8, rotation=90)
And, also draw the bar graph of Average Total Payments for each state.
df2 = df.groupby('ProviderState',as_index=False)[['AvrTotalPayments']].mean()
plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderState', y='AvrTotalPayments',
data=df2.sort_values('AvrTotalPayments', ascending=False))
plt.title('Mean Average Total Payments per State', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)
I also want to figure how each diagnosis priced in each state.
df = df.assign(DRGCode=df.DRGDef.str[:3])
df6 = df.loc[:, ['DRGCode', 'ProviderState', 'AvrTotalPayments' ]]
df6= df6.groupby(['ProviderState', 'DRGCode' ],as_index=False)[['AvrTotalPayments']].mean()
df6 = df6.pivot_table('AvrTotalPayments', ['DRGCode'], 'ProviderState') #convert rows to columns
df6 = df6.fillna(0) # convert NaN to 0
df6_norm = (df6 - df6.min()) / (df6.max() - df6.min()) #normalization
plt.figure(figsize=(20,15), dpi= 80)
sns.heatmap(df6_norm, square=False, cbar=True, cmap="coolwarm")
plt.title("Average Total Payment (Normalized) per ProviderState and DRG Code", fontsize=12)
The top 5 expensive DRG Definition- diagnosis are;
while the chepaest ones are;
and the most expensive states are;
while the cheapest states are;
While Alaska has the lowest frequency of observations, it is the most expensive state.
When we scan the heatmap from left to right, it looks like DRGs (from blue to red, gets expensive) are not significantly deviates across states.
Continue to analysis with investigating states by citites for the most expensive and cheap states as well as the most expensive diagnosis.
df3 = df[(df.ProviderState == 'AK')]
# | (df.ProviderState == 'DC') | (df.ProviderState == 'HI') | (df.ProviderState == 'CA') | (df.ProviderState == 'MD')
df3 = df3.groupby(['ProviderState','ProviderCity'],as_index=False)[['AvrTotalPayments']].mean()
plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderCity', y='AvrTotalPayments', hue="ProviderState",
data=df3.sort_values('AvrTotalPayments', ascending=False),color="navy", saturation=90)
plt.title('Mean Average Total Payments per City of Alaska', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)
df4 = df[(df.ProviderState == 'AL')]
# | (df.ProviderState == 'DC') | (df.ProviderState == 'HI') | (df.ProviderState == 'CA') | (df.ProviderState == 'MD')
df4 = df4.groupby(['ProviderState','ProviderCity'],as_index=False)[['AvrTotalPayments']].mean()
plt.figure(figsize=(16,10), dpi= 80)
sns.barplot(x='ProviderCity', y='AvrTotalPayments', hue="ProviderState",
data=df4.sort_values('AvrTotalPayments', ascending=False), color="navy", saturation=90)
plt.title('Mean Average Total Payments per City of Alabama', fontsize=12)
plt.ylabel('Mean of Average Total Payments', fontsize=8)
plt.xticks(fontsize=12, rotation=90)
Alaska, the most expensive state on average, have 7 cities. And average total payment in "Fairbank" is the highest one. In Alabama, being the cheapest on average, Phenix City is outlier in state with average total payment over 11k, expensive as much as moderate cities in Alaska.
df5 = df[df['DRGDef'].str.contains("870")]
df5 = df5.groupby(['ProviderCity','ProviderState'],as_index=False)['AvrTotalPayments'].mean()
plt.figure(figsize=(16,100), dpi= 80)
sns.barplot(y='ProviderCity', x='AvrTotalPayments',
data=df5.sort_values('AvrTotalPayments', ascending=False))
plt.title('Mean Average Total Payments for DRG def "870-Septicemia or Severe Sepsis" per State', fontsize=12)
plt.ylabel('Provider City', fontsize=8)
plt.xticks(fontsize=12, rotation=90)
Taking the Septicemia or Severe Sepsis diagnosis on hand, in case we plot the cities in US, it is seen that Valhalla is the most expensive city being almost 6 times expensive than Canonsburg.
As a result of analysis of average total payment for top 100 Medicare Severity Diagnosis Related Group reported by 3201 provider in 1,977 cities of US where belongs to 306 region and 51 States in US.
https://www.kaggle.com/speedoheck/inpatient-hospital-charges/home
https://data.cms.gov/Medicare-Inpatient/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3
https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
https://www.kaggle.com/mlesna/data-analysis-and-visualizations