Medicaid Spending by Drug 2018 to 2022
The following statement is from Centers for Medicare & Medicaid Services which is who published this data
“The Medicaid by Drug dataset presents information on spending for covered outpatient drugs prescribed to beneficiaries enrolled in Medicaid by physicians and other healthcare professionals.
The dataset focuses on average spending per dosage unit and change in average spending per dosage unit over time. Units refer to the drug unit in the lowest dispensable amount. It also includes spending information for manufacturer(s) of the drugs as well as consumer-friendly information of drug uses and clinical indications.
Drug spending metrics for Medicaid represent the total amount reimbursed by both Medicaid and non-Medicaid entities to pharmacies for the drug. Medicaid drug spending contains both the Federal and State reimbursement and is inclusive of any applicable dispensing fees. In addition, this total is not reduced or affected by Medicaid rebates paid to the states.”
https://drive.google.com/drive/folders/1Y9TCKXrM5Ejq30c_S12gf5abQZnAzCS_?usp=sharing
(Intercept) Tot_Clms_2021
0.027944457 0.390304766
Tot_Spndng_2022 Avg_Spnd_Per_Clm_2022
0.968748164 -1.023689024
Tot_Dsg_Unts_2022 Avg_Spnd_Per_Dsg_Unt_Wghtd_2022
0.021380416 0.086762779
Tot_Dsg_Unts_2021 Avg_Spnd_Per_Clm_2018
-0.018259898 0.007006525
Tot_Clms_2019 Tot_Spndng_2021
-0.075123286 -0.365476981
Avg_Spnd_Per_Clm_2021 Avg_Spnd_Per_Dsg_Unt_Wghtd_2021
0.391885522 -0.044258026
Chg_Avg_Spnd_Per_Dsg_Unt_21_22 Tot_Clms_2020
-0.026432371 0.600234868
Tot_Spndng_2020 Avg_Spnd_Per_Clm_2020
-0.576232149 0.618370253
CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 Avg_Spnd_Per_Dsg_Unt_Wghtd_2020
-0.109747671 -0.028563344
Tot_Dsg_Unts_2020 Tot_Dsg_Unts_2018
-0.019830849 0.030375072
Tot_Mftr Tot_Dsg_Unts_2019
0.005607305 -0.010183740
Avg_Spnd_Per_Dsg_Unt_Wghtd_2019 Tot_Spndng_2019
-0.008038990 0.081092617
Avg_Spnd_Per_Clm_2019 Tot_Spndng_2018
-0.078954261 -0.029368270
The multiple linear regression is fitted on a log transformation of all variables to improve normality and variance concerns. In order to identify the most optimal predictors for claim count in 2022, stepwise regression is used. This process starts from the null model and adds/removes variables to reach the combination of variables that best fit the data. The results show that the best variables to include in the model are Total Claims, Spending, and Dosage Units for previous years, along with number of manufacturers and outlier flags.
Call:
lm(formula = Tot_Clms_2022 ~ Tot_Clms_2021 + Tot_Spndng_2022 +
Avg_Spnd_Per_Clm_2022 + Tot_Dsg_Unts_2022 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2022 +
Tot_Dsg_Unts_2021 + Avg_Spnd_Per_Clm_2018 + Tot_Clms_2019 +
Tot_Spndng_2021 + Avg_Spnd_Per_Clm_2021 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2021 +
Chg_Avg_Spnd_Per_Dsg_Unt_21_22 + Tot_Clms_2020 + Tot_Spndng_2020 +
Avg_Spnd_Per_Clm_2020 + CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 +
Avg_Spnd_Per_Dsg_Unt_Wghtd_2020 + Tot_Dsg_Unts_2020 + Tot_Dsg_Unts_2018 +
Tot_Mftr + Tot_Dsg_Unts_2019 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2019 +
Tot_Spndng_2019 + Avg_Spnd_Per_Clm_2019 + Tot_Spndng_2018,
data = mlrData)
Residuals:
Min 1Q Median 3Q Max
-1.29306 -0.00919 -0.00116 0.00678 1.39686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.027944 0.002733 10.226 < 2e-16 ***
Tot_Clms_2021 0.390305 0.005578 69.969 < 2e-16 ***
Tot_Spndng_2022 0.968748 0.002054 471.638 < 2e-16 ***
Avg_Spnd_Per_Clm_2022 -1.023689 0.001660 -616.573 < 2e-16 ***
Tot_Dsg_Unts_2022 0.021380 0.001961 10.900 < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2022 0.086763 0.002213 39.203 < 2e-16 ***
Tot_Dsg_Unts_2021 -0.018260 0.002117 -8.626 < 2e-16 ***
Avg_Spnd_Per_Clm_2018 0.007007 0.001561 4.487 7.28e-06 ***
Tot_Clms_2019 -0.075123 0.011029 -6.811 1.02e-11 ***
Tot_Spndng_2021 -0.365477 0.005281 -69.211 < 2e-16 ***
Avg_Spnd_Per_Clm_2021 0.391886 0.005801 67.557 < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2021 -0.044258 0.002467 -17.940 < 2e-16 ***
Chg_Avg_Spnd_Per_Dsg_Unt_21_22 -0.026432 0.001555 -17.001 < 2e-16 ***
Tot_Clms_2020 0.600235 0.007918 75.803 < 2e-16 ***
Tot_Spndng_2020 -0.576232 0.007721 -74.634 < 2e-16 ***
Avg_Spnd_Per_Clm_2020 0.618370 0.008433 73.329 < 2e-16 ***
CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 -0.109748 0.007675 -14.300 < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2020 -0.028563 0.002591 -11.026 < 2e-16 ***
Tot_Dsg_Unts_2020 -0.019831 0.002270 -8.737 < 2e-16 ***
Tot_Dsg_Unts_2018 0.030375 0.002253 13.480 < 2e-16 ***
Tot_Mftr 0.005607 0.001052 5.332 9.90e-08 ***
Tot_Dsg_Unts_2019 -0.010184 0.002384 -4.271 1.96e-05 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2019 -0.008039 0.002430 -3.308 0.000943 ***
Tot_Spndng_2019 0.081093 0.010718 7.566 4.15e-14 ***
Avg_Spnd_Per_Clm_2019 -0.078954 0.011539 -6.842 8.20e-12 ***
Tot_Spndng_2018 -0.029368 0.002300 -12.768 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04357 on 11264 degrees of freedom
Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
F-statistic: 2.052e+06 on 25 and 11264 DF, p-value: < 2.2e-16
[1] "RMSE: 0.0460346572318553"
The final model has a statistically significant p-value of less than 2.2e-16, and an R-squared of 0.9998, indicating the model fits 99.98% of the data. The RMSE value of 0.046 is also very low, indicating that the predicted values of the model were very close to the actual results in the testing set.
The regression equation is:
Tot_Clms_2022 = 0.390305(Tot_Clms_2021) + 0.968748(Tot_Spndng_2022) -1.023689(Avg_Spnd_Per_Clm_2022) + 0.021380(Tot_Dsg_Unts_2022) + 0.086763(Avg_Spnd_Per_Dsg_Unt_Wghtd_2022) -0.018260(Tot_Dsg_Unts_2021) + 0.007007(Avg_Spnd_Per_Clm_2018) -0.075123(Tot_Clms_2019) -0.365477(Tot_Spndng_2021) + 0.391886Avg_Spnd_Per_Clm_2021) -0.044258(Avg_Spnd_Per_Dsg_Unt_Wghtd_2021) -0.026432(Chg_Avg_Spnd_Per_Dsg_Unt_21_22) + 0.600235(Tot_Clms_2020) -0.576232(Tot_Spndng_2020) + 0.618370(Avg_Spnd_Per_Clm_2020) -0.109748(CAGR_Avg_Spnd_Per_Dsg_Unt_18_22) -0.028563(Avg_Spnd_Per_Dsg_Unt_Wghtd_2020) -0.019831(Tot_Dsg_Unts_2020) + 0.030375(Tot_Dsg_Unts_2018) + 0.005607(Tot_Mftr) -0.010184(Tot_Dsg_Unts_2019) -0.008039(Avg_Spnd_Per_Dsg_Unt_Wghtd_2019) + 0.081093(Tot_Spndng_2019) -0.078954(Avg_Spnd_Per_Clm_2019) -0.029368(Tot_Spndng_2018) + 0.027944
Of the variables included in the model, all are statistically significant at alpha=0.05. As a general trend from the model results, the most influential factors in predicting the amount of claims in future years is the previous years claims, spending, and dosage units.
After applying the log transformation, the residuals vs. fitted plot is more evenly distributed. There are a few points with larger residual values, but the majority of residuals center around zero.
Although all marked outliers were removed, the residuals vs. leverage plot shows that observations 4584, 4585, and 9419 are all highly influential points.
The Predicted vs. Actual plot shows that the the predicted and actual Medicaid claim count values of the MLR model are almost identical. This indicates that the model has high accuracy.
Research Question: Can we predict the total number of medicaid claims for 2022 based on previous years
Conclusion: Overall, the MLR analysis shows that the Medicaid claim counts can be predicted for the year 2022 with high accuracy. The R2 value of our model is 0.9998, and the RMSE is 0.0446, indicating high model performance. From our analysis, the most important predictors are the previous years’ claim counts, spending, and dosage units.
Research Question: Can we use the previous years amount spent to predict the amount spent in 2022?
Conclusion: Our model has an R_squared of[1] 0.9530939
The best lambda for the regression was
[1] 12101125
The following is a table of the coefficients of our model
Variables | Values |
---|---|
(Intercept) | -3.447282e+05 |
2018 Spending | -2.076970e-01 |
2019 Spending | 7.460894e-02 |
2020 Spending | 4.389263e-01 |
2021 Spending | 7.726332e-01 |
The Ridge Regression model is able to predict 2022 Medicaid spending based on the spending from previous years. With an R-squared of 0.953 this model has little varience between the true values and what is being predicted by the model. The coefficients help us interprate how the previous years affect the spending in 2022. The spending in 2021 has the largest impact of 0.7726 and 2020 being the next largest with 0.4389, 2019 has a smaller impact with a coefficient of 0.0746. 2018 is different than the rest of the coefficients because it is the only one with negative impact excluding the intercept.
In order to investigate what variables are associated with one another, a correlation matrix is used. Based on the results, Total Claims and Total Dosage Units have the highest correlation of 0.853. The claims data represents the volume of prescriptions filled, while the dosage units represents the concentration of the medication and the quantities dispensed per prescription. This relationship can be investigated further through comparing trends for both generic and brand name drugs.
Degree | MSE_Generic | MSE_Brand |
---|---|---|
1 | 1.4010 | 3.1338 |
2 | 1.3955 | 3.1338 |
In order to determine which polynomial is best to use for a LOESS model, both the generic and brand name drug subsets of the Medicaid claims data were fit using degree=1 and degree=2. The table above shows the resulting MSE values of each fit. The results show that degree=2 better fits the data due to lower MSE values for both drug type subsets.
Span | MSE_Generic | MSE_Brand |
---|---|---|
0.25 | 1.3807 | 3.1220 |
0.30 | 1.3871 | 3.1228 |
0.35 | 1.3900 | 3.1234 |
0.40 | 1.3917 | 3.1242 |
0.45 | 1.3919 | 3.1253 |
0.50 | 1.3921 | 3.1264 |
0.55 | 1.3923 | 3.1273 |
0.60 | 1.3929 | 3.1281 |
0.65 | 1.3935 | 3.1285 |
0.70 | 1.3939 | 3.1291 |
0.75 | 1.3941 | 3.1296 |
In order to determine the best span, each of the 3 categories are fit using span values in range 0.25-0.75 (as suggested for group exercise 2). The output shows that span values 0.25 and 0.3 both have the lowest resulting MSE value. Based on this, span=0.25 will be used for the LOESS fit.
Group | Span | Degree | Number_of_Observations | Equivalent_Number_of_Parameters | Residual_Standard_Error | Trace_of_Smoother_Matrix |
---|---|---|---|---|---|---|
Generic Drugs | 0.25 | 2 | 579 | 12.92304 | 1.191257 | 563.3539 |
Brand Drugs | 0.25 | 2 | 2728 | 12.87655 | 1.771994 | 2712.4172 |
Both the generic and brand name drug subsets were fit using LOESS with span 0.25 and degree=2. The table above shows that there are significantly more brand name drugs in the dataset (2728) compared to generic drugs (578). The generic drug subset has a lower residual standard error, indicating a slightly better fit compared to the brand name drug subset.
The graph above shows the relationship between dosage units and claims for just the generic drugs. As the total claims increase, so do the total dosage units. At lower claim counts, the data is slightly more spread out, and this variance decreases slightly at higher claim counts. One possible explanation is that higher demand generic medications may have more standardized prescribed dosage levels and quantities, whereas there might be more variation with less commonly used medications.
The degree 1 and degree 2 fits are very similar, however the degree 2 LOESS model fits the data slightly better.
The second graph shows the dosage units and claims for the brand name drugs. The dataset contains far more instances of brand name drugs than generic, which is reflected in the points on the graph. Compared to the generic drugs, the data is more evenly distributed across all claim count levels. Although the degree 1 and 2 models are very similar, degree 2 LOESS seems to fit the data better.
Research Question: What is the relationship between dosage units and claims for both brand name and generic drugs?
Conclusion:
Both LOESS graphs show that an increase in claims is accompanied by an increase in dosage units for both generic and brand name drugs. Both drug subsets have approximately the same rate of growth. Comparing the two, the brand name drug data is more evenly distributed, while the generic data at higher claim counts has less spread.
Dosage units represent the concentration and quantities of medication dispensed per prescription filled. The trends show that generic medications, especially at higher demand, are may be prescribed in bulk or at high standardized doses. In comparison, brand name drugs may have more variance in prescription due to more dosage levels, or prescription regulations.
Confusion Matrix and Statistics
Reference
Prediction TRUE FALSE
TRUE 178 112
FALSE 183 519
Accuracy : 0.7026
95% CI : (0.6731, 0.7309)
No Information Rate : 0.6361
P-Value [Acc > NIR] : 5.819e-06
Kappa : 0.3294
Mcnemar's Test P-Value : 4.590e-05
Sensitivity : 0.4931
Specificity : 0.8225
Pos Pred Value : 0.6138
Neg Pred Value : 0.7393
Prevalence : 0.3639
Detection Rate : 0.1794
Detection Prevalence : 0.2923
Balanced Accuracy : 0.6578
'Positive' Class : TRUE
Research Question: Can we use the medicaid spending and claims over 2018-2022 to determine if there are multiple manufactures of a medication?
The kNN classification model, developed to predict whether a medication has multiple manufactures based on the total spending and total claims from the years 2018 to 2022, achieved a success rate of 70.26% on the testing set. This success rate is better than the no information success rate of 63.61%. The model has a p-value of (5.819e-6) near zero which tells us that our model’s ability to make predictions is significant. The model’s sensitivity of 49.31% suggest that it has a not as good ability in correctly identifying medications with multiple manufactures, but the specificity of 82.25% suggest that our model has a much ability in identifying medications that only have one manufacture. With this model we expect a misclassifcation rate of around 29.74%
Category | Prior | Train | Test |
---|---|---|---|
Brand | 0.8249 | 0.822 | 0.8318 |
Generic | 0.1751 | 0.178 | 0.1682 |
The output shows that roughly 82.5% of the data represents brand name drugs, while around 17.5% represents generic drugs. Although a more even split would be ideal for a naive bayes model, this is the data distribution available in the Medicaid claims dataset.
Proportion_Correct | Missclassification |
---|---|
0.8503 | 0.1631 |
Overall, the Naive Bayes model of generic and brand name drugs has a classification accuracy of around 0.85, with a misclassification rate of around 0.16. This indicates reasonable accuracy, although significant improvements can still be made to improve the classification rate.
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 993
| Actual
Predicted | Brand | Generic | Row Total |
-------------|-----------|-----------|-----------|
Brand | 689 | 25 | 714 |
| 0.834 | 0.150 | |
-------------|-----------|-----------|-----------|
Generic | 137 | 142 | 279 |
| 0.166 | 0.850 | |
-------------|-----------|-----------|-----------|
Column Total | 826 | 167 | 993 |
| 0.832 | 0.168 | |
-------------|-----------|-----------|-----------|
The confusion matrix above displays the results comparing the predicted drug type classifications of the test set using the Naive Bayes model. The model was able to correctly classify 83.4% of brand name drugs, and 85% of generic drugs. Exactly 15% of the generic drugs were falsely classified as brand name, and 16.6% of brand name drugs were classified as generic. One explanation for the misclassifications is the imbalance in the dataset. The majority of the observations are brand name drugs, with only ~ 17.5% of the data representing generics. With increased Medicaid claims data on generic drugs, there may be an increase in model accuracy.
The ROC curve above shows the classification performance of our model on the testing dataset. The plotted line approaches the top left corner of the model, indicating high sensitivity and high specificity. There is also a large area under the curve, however we still see room for improvement in terms of increasing sensitivity and specificity values. Again, this could be improved with more data on generic drugs.
Research Question: Can we predict if a drug is brand name or generic based on the Medicaid spending trends?
Conclusion: The analysis shows that the Naive Bayes model was able to correctly classify ~ 85% of brand name and generic drugs. Based on the current performance, the model is able to predict drug type from Medicaid spending trends with reasonable accuracy. With more observations on generic drugs to add to the Medicaid claims dataset, the model accuracy could be significantly improved.
Analysis of Deviance Table
Model 1: Multiple_Mftr ~ 1
Model 2: Multiple_Mftr ~ Tot_Clms_2022 + Tot_Spndng_2022
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 3306 4314.1
2 3304 3632.8 2 681.38 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Predictions 0 1
No 595 197
Yes 37 163
When we use the model to make predictions on the testing data using a threshold of 0.35, the model is able to predict 593 of the true no’s correctly and 33 of the true no’s wrong and 172 of the true yes’s correctly with 194 of them wrong. This gives us an overall success rate of 77.1% which is good and much better to help make predictions compared to if we didn’t use this model. The difference between the model correctly prediction yes’s vs. no’s is large with a 94.7% success rate when predicting a No on the true no’s, but only a 47.% success rate for the model predicting a yes on the true yes’s. This information could come in handy when it comes to using this model to make predictions and you can have more confidence in getting a true no compared to a true yes.
Research Question: Can the total spending and claims for drugs using medicaid during the year 2022 help us predict if a medication is made by multiple manufactures?
Conclusion: Using the data to make a Logistic Regression model on the testing data we were able to determine if a medication has multiple manufactures with a success rate of 75.6% on our testing data with a threshold of 0.35. The model has the following coefficientsVariables | Values |
---|---|
Intercept | -7.777877e-01 |
Total Claims in 2022 | 7.574508e-06 |
Total Spending 2022 | -2.548977e-08 |
y = -0.7777877 + 7.574508e-06x₁ - 2.548977e-08x₂
a misclassification error of[1] 0.1814865
and a p-value of
[1] 1.096465e-148
Since this model is able to make predictions with a low misclassification error rate and has a p-value of near zero We are able to confidently say that this model can be used to predict if a medication has multiple manufactures based on the spending and claims data from 2022. This model is most likely only good to use until we have data for 2023 because that will give us more up to date information compared to the 2022 data we are using for the model.