Introduction

Row

Medicaid Spending by Drug 2018 to 2022

The following statement is from Centers for Medicare & Medicaid Services which is who published this data

“The Medicaid by Drug dataset presents information on spending for covered outpatient drugs prescribed to beneficiaries enrolled in Medicaid by physicians and other healthcare professionals.

The dataset focuses on average spending per dosage unit and change in average spending per dosage unit over time. Units refer to the drug unit in the lowest dispensable amount. It also includes spending information for manufacturer(s) of the drugs as well as consumer-friendly information of drug uses and clinical indications.

Drug spending metrics for Medicaid represent the total amount reimbursed by both Medicaid and non-Medicaid entities to pharmacies for the drug. Medicaid drug spending contains both the Federal and State reimbursement and is inclusive of any applicable dispensing fees. In addition, this total is not reduced or affected by Medicaid rebates paid to the states.”

Row

Data Dictionary

https://drive.google.com/drive/folders/1Y9TCKXrM5Ejq30c_S12gf5abQZnAzCS_?usp=sharing

Brnd_Name: Brand name of the drug.
Gnrc_Name: Generic name of the drug.
Tot_Mftr: Total number of manufacturers for the drug.
Mftr_Name: Name of the manufacturer.
Tot_Spndng_2018: Total spending on the drug in 2018.
Tot_Dsg_Unts_2018: Total dosage units distributed in 2018.
Tot_Clms_2018: Total claims made for the drug in 2018.
Tot_Spndng_2019: Total spending on the drug in 2019.
Tot_Dsg_Unts_2019: Total dosage units distributed in 2019.
Tot_Clms_2019: Total claims made for the drug in 2019.
Tot_Spndng_2020: Total spending on the drug in 2020.
Tot_Dsg_Unts_2020: Total dosage units distributed in 2020.
Tot_Clms_2020: Total claims made for the drug in 2020.
Tot_Spndng_2021: Total spending on the drug in 2021.
Tot_Dsg_Unts_2021: Total dosage units distributed in 2021.
Tot_Clms_2021: Total claims made for the drug in 2021.
Tot_Spndng_2022: Total spending on the drug in 2022.
Tot_Dsg_Unts_2022: Total dosage units distributed in 2022.
Tot_Clms_2022: Total claims made for the drug in 2022.
Chg_Avg_Spnd_Per_Dsg_Unt_21_22: Change in average spending per dosage unit from 2021 to 2022.
CAGR_Avg_Spnd_Per_Dsg_Unt_18_22: Compound annual growth rate for average spending per dosage unit from 2018 to 2022.
Tot_Spndng_ALL_YEARS: Total spending on the drug across all years.
Tot_Clms_ALL_YEARS: Total claims made for the drug across all years.
Tot_Dsg_Unts_ALL_YEARS: Total dosage units distributed across all years.
Multiple_Mftr: Indicator for drugs with multiple manufacturers.
US_Based: Indicator if the manufacturer is based in the US.
Drug_Type: Classification of the drug type (e.g., generic or brand name)

MLR

Column

Model Selection

                    (Intercept)                   Tot_Clms_2021 
                    0.027944457                     0.390304766 
                Tot_Spndng_2022           Avg_Spnd_Per_Clm_2022 
                    0.968748164                    -1.023689024 
              Tot_Dsg_Unts_2022 Avg_Spnd_Per_Dsg_Unt_Wghtd_2022 
                    0.021380416                     0.086762779 
              Tot_Dsg_Unts_2021           Avg_Spnd_Per_Clm_2018 
                   -0.018259898                     0.007006525 
                  Tot_Clms_2019                 Tot_Spndng_2021 
                   -0.075123286                    -0.365476981 
          Avg_Spnd_Per_Clm_2021 Avg_Spnd_Per_Dsg_Unt_Wghtd_2021 
                    0.391885522                    -0.044258026 
 Chg_Avg_Spnd_Per_Dsg_Unt_21_22                   Tot_Clms_2020 
                   -0.026432371                     0.600234868 
                Tot_Spndng_2020           Avg_Spnd_Per_Clm_2020 
                   -0.576232149                     0.618370253 
CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 Avg_Spnd_Per_Dsg_Unt_Wghtd_2020 
                   -0.109747671                    -0.028563344 
              Tot_Dsg_Unts_2020               Tot_Dsg_Unts_2018 
                   -0.019830849                     0.030375072 
                       Tot_Mftr               Tot_Dsg_Unts_2019 
                    0.005607305                    -0.010183740 
Avg_Spnd_Per_Dsg_Unt_Wghtd_2019                 Tot_Spndng_2019 
                   -0.008038990                     0.081092617 
          Avg_Spnd_Per_Clm_2019                 Tot_Spndng_2018 
                   -0.078954261                    -0.029368270

The multiple linear regression is fitted on a log transformation of all variables to improve normality and variance concerns. In order to identify the most optimal predictors for claim count in 2022, stepwise regression is used. This process starts from the null model and adds/removes variables to reach the combination of variables that best fit the data. The results show that the best variables to include in the model are Total Claims, Spending, and Dosage Units for previous years, along with number of manufacturers and outlier flags.

Final Model


Call:
lm(formula = Tot_Clms_2022 ~ Tot_Clms_2021 + Tot_Spndng_2022 + 
    Avg_Spnd_Per_Clm_2022 + Tot_Dsg_Unts_2022 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2022 + 
    Tot_Dsg_Unts_2021 + Avg_Spnd_Per_Clm_2018 + Tot_Clms_2019 + 
    Tot_Spndng_2021 + Avg_Spnd_Per_Clm_2021 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2021 + 
    Chg_Avg_Spnd_Per_Dsg_Unt_21_22 + Tot_Clms_2020 + Tot_Spndng_2020 + 
    Avg_Spnd_Per_Clm_2020 + CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 + 
    Avg_Spnd_Per_Dsg_Unt_Wghtd_2020 + Tot_Dsg_Unts_2020 + Tot_Dsg_Unts_2018 + 
    Tot_Mftr + Tot_Dsg_Unts_2019 + Avg_Spnd_Per_Dsg_Unt_Wghtd_2019 + 
    Tot_Spndng_2019 + Avg_Spnd_Per_Clm_2019 + Tot_Spndng_2018, 
    data = mlrData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29306 -0.00919 -0.00116  0.00678  1.39686 

Coefficients:
                                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)                      0.027944   0.002733   10.226  < 2e-16 ***
Tot_Clms_2021                    0.390305   0.005578   69.969  < 2e-16 ***
Tot_Spndng_2022                  0.968748   0.002054  471.638  < 2e-16 ***
Avg_Spnd_Per_Clm_2022           -1.023689   0.001660 -616.573  < 2e-16 ***
Tot_Dsg_Unts_2022                0.021380   0.001961   10.900  < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2022  0.086763   0.002213   39.203  < 2e-16 ***
Tot_Dsg_Unts_2021               -0.018260   0.002117   -8.626  < 2e-16 ***
Avg_Spnd_Per_Clm_2018            0.007007   0.001561    4.487 7.28e-06 ***
Tot_Clms_2019                   -0.075123   0.011029   -6.811 1.02e-11 ***
Tot_Spndng_2021                 -0.365477   0.005281  -69.211  < 2e-16 ***
Avg_Spnd_Per_Clm_2021            0.391886   0.005801   67.557  < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2021 -0.044258   0.002467  -17.940  < 2e-16 ***
Chg_Avg_Spnd_Per_Dsg_Unt_21_22  -0.026432   0.001555  -17.001  < 2e-16 ***
Tot_Clms_2020                    0.600235   0.007918   75.803  < 2e-16 ***
Tot_Spndng_2020                 -0.576232   0.007721  -74.634  < 2e-16 ***
Avg_Spnd_Per_Clm_2020            0.618370   0.008433   73.329  < 2e-16 ***
CAGR_Avg_Spnd_Per_Dsg_Unt_18_22 -0.109748   0.007675  -14.300  < 2e-16 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2020 -0.028563   0.002591  -11.026  < 2e-16 ***
Tot_Dsg_Unts_2020               -0.019831   0.002270   -8.737  < 2e-16 ***
Tot_Dsg_Unts_2018                0.030375   0.002253   13.480  < 2e-16 ***
Tot_Mftr                         0.005607   0.001052    5.332 9.90e-08 ***
Tot_Dsg_Unts_2019               -0.010184   0.002384   -4.271 1.96e-05 ***
Avg_Spnd_Per_Dsg_Unt_Wghtd_2019 -0.008039   0.002430   -3.308 0.000943 ***
Tot_Spndng_2019                  0.081093   0.010718    7.566 4.15e-14 ***
Avg_Spnd_Per_Clm_2019           -0.078954   0.011539   -6.842 8.20e-12 ***
Tot_Spndng_2018                 -0.029368   0.002300  -12.768  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04357 on 11264 degrees of freedom
Multiple R-squared:  0.9998,    Adjusted R-squared:  0.9998 
F-statistic: 2.052e+06 on 25 and 11264 DF,  p-value: < 2.2e-16

[1] "RMSE:  0.0460346572318553"

The final model has a statistically significant p-value of less than 2.2e-16, and an R-squared of 0.9998, indicating the model fits 99.98% of the data. The RMSE value of 0.046 is also very low, indicating that the predicted values of the model were very close to the actual results in the testing set.

The regression equation is:

Tot_Clms_2022 = 0.390305(Tot_Clms_2021) + 0.968748(Tot_Spndng_2022) -1.023689(Avg_Spnd_Per_Clm_2022) + 0.021380(Tot_Dsg_Unts_2022) + 0.086763(Avg_Spnd_Per_Dsg_Unt_Wghtd_2022) -0.018260(Tot_Dsg_Unts_2021) + 0.007007(Avg_Spnd_Per_Clm_2018) -0.075123(Tot_Clms_2019) -0.365477(Tot_Spndng_2021) + 0.391886Avg_Spnd_Per_Clm_2021) -0.044258(Avg_Spnd_Per_Dsg_Unt_Wghtd_2021) -0.026432(Chg_Avg_Spnd_Per_Dsg_Unt_21_22) + 0.600235(Tot_Clms_2020) -0.576232(Tot_Spndng_2020) + 0.618370(Avg_Spnd_Per_Clm_2020) -0.109748(CAGR_Avg_Spnd_Per_Dsg_Unt_18_22) -0.028563(Avg_Spnd_Per_Dsg_Unt_Wghtd_2020) -0.019831(Tot_Dsg_Unts_2020) + 0.030375(Tot_Dsg_Unts_2018) + 0.005607(Tot_Mftr) -0.010184(Tot_Dsg_Unts_2019) -0.008039(Avg_Spnd_Per_Dsg_Unt_Wghtd_2019) + 0.081093(Tot_Spndng_2019) -0.078954(Avg_Spnd_Per_Clm_2019) -0.029368(Tot_Spndng_2018) + 0.027944

Of the variables included in the model, all are statistically significant at alpha=0.05. As a general trend from the model results, the most influential factors in predicting the amount of claims in future years is the previous years claims, spending, and dosage units.

Residuals vs. Fitted

After applying the log transformation, the residuals vs. fitted plot is more evenly distributed. There are a few points with larger residual values, but the majority of residuals center around zero.

Normal Q-Q Plot

This normal Q-Q plot is the result of a log transformation of the dataset, which had the greatest improvement on normality out of several other transformations. The plot shows that the majority of points at the center align with the red line. This indicates a mostly normal distribution, however there are some deviations at either end of the graph indicating there may still be some skewness.

Residuals vs. Leverage Plot

Although all marked outliers were removed, the residuals vs. leverage plot shows that observations 4584, 4585, and 9419 are all highly influential points.

Predicted vs. Actual Plot

The Predicted vs. Actual plot shows that the the predicted and actual Medicaid claim count values of the MLR model are almost identical. This indicates that the model has high accuracy.

Column

Text

Research Question: Can we predict the total number of medicaid claims for 2022 based on previous years

Conclusion: Overall, the MLR analysis shows that the Medicaid claim counts can be predicted for the year 2022 with high accuracy. The R2 value of our model is 0.9998, and the RMSE is 0.0446, indicating high model performance. From our analysis, the most important predictors are the previous years’ claim counts, spending, and dosage units.

Ridge Regression

Row

Model

Coefficient Plot

Row

Chart 3

Research Question: Can we use the previous years amount spent to predict the amount spent in 2022?

Conclusion: Our model has an R_squared of

[1] 0.9530939

The best lambda for the regression was

[1] 12101125

The following is a table of the coefficients of our model

Ridge Regression Coefficients
Variables	Values
(Intercept)	-3.447282e+05
2018 Spending	-2.076970e-01
2019 Spending	7.460894e-02
2020 Spending	4.389263e-01
2021 Spending	7.726332e-01

The Ridge Regression model is able to predict 2022 Medicaid spending based on the spending from previous years. With an R-squared of 0.953 this model has little varience between the true values and what is being predicted by the model. The coefficients help us interprate how the previous years affect the spending in 2022. The spending in 2021 has the largest impact of 0.7726 and 2020 being the next largest with 0.4389, 2019 has a smaller impact with a coefficient of 0.0746. 2018 is different than the rest of the coefficients because it is the only one with negative impact excluding the intercept.

LOESS

Column

Correlation Matrix

Correlation Matrix Text

In order to investigate what variables are associated with one another, a correlation matrix is used. Based on the results, Total Claims and Total Dosage Units have the highest correlation of 0.853. The claims data represents the volume of prescriptions filled, while the dosage units represents the concentration of the medication and the quantities dispensed per prescription. This relationship can be investigated further through comparing trends for both generic and brand name drugs.

LOESS Degree Comparison

MSE Values for Degrees 1 and 2
Degree	MSE_Generic	MSE_Brand
1	1.4010	3.1338
2	1.3955	3.1338

In order to determine which polynomial is best to use for a LOESS model, both the generic and brand name drug subsets of the Medicaid claims data were fit using degree=1 and degree=2. The table above shows the resulting MSE values of each fit. The results show that degree=2 better fits the data due to lower MSE values for both drug type subsets.

LOESS Span Comparison

MSE Values for Generic and Brand Drugs
Span	MSE_Generic	MSE_Brand
0.25	1.3807	3.1220
0.30	1.3871	3.1228
0.35	1.3900	3.1234
0.40	1.3917	3.1242
0.45	1.3919	3.1253
0.50	1.3921	3.1264
0.55	1.3923	3.1273
0.60	1.3929	3.1281
0.65	1.3935	3.1285
0.70	1.3939	3.1291
0.75	1.3941	3.1296

In order to determine the best span, each of the 3 categories are fit using span values in range 0.25-0.75 (as suggested for group exercise 2). The output shows that span values 0.25 and 0.3 both have the lowest resulting MSE value. Based on this, span=0.25 will be used for the LOESS fit.

Generic and Brand Name Drug LOESS Fit

Summary of LOESS Fit for Generic and Brand Name Drugs
Group	Span	Degree	Number_of_Observations	Equivalent_Number_of_Parameters	Residual_Standard_Error	Trace_of_Smoother_Matrix
Generic Drugs	0.25	2	579	12.92304	1.191257	563.3539
Brand Drugs	0.25	2	2728	12.87655	1.771994	2712.4172

Both the generic and brand name drug subsets were fit using LOESS with span 0.25 and degree=2. The table above shows that there are significantly more brand name drugs in the dataset (2728) compared to generic drugs (578). The generic drug subset has a lower residual standard error, indicating a slightly better fit compared to the brand name drug subset.

Generic Drug LOESS Chart

The graph above shows the relationship between dosage units and claims for just the generic drugs. As the total claims increase, so do the total dosage units. At lower claim counts, the data is slightly more spread out, and this variance decreases slightly at higher claim counts. One possible explanation is that higher demand generic medications may have more standardized prescribed dosage levels and quantities, whereas there might be more variation with less commonly used medications.

The degree 1 and degree 2 fits are very similar, however the degree 2 LOESS model fits the data slightly better.

Brand Name Drug LOESS Chart

The second graph shows the dosage units and claims for the brand name drugs. The dataset contains far more instances of brand name drugs than generic, which is reflected in the points on the graph. Compared to the generic drugs, the data is more evenly distributed across all claim count levels. Although the degree 1 and 2 models are very similar, degree 2 LOESS seems to fit the data better.

Column

Text

Research Question: What is the relationship between dosage units and claims for both brand name and generic drugs?

Conclusion:

Both LOESS graphs show that an increase in claims is accompanied by an increase in dosage units for both generic and brand name drugs. Both drug subsets have approximately the same rate of growth. Comparing the two, the brand name drug data is more evenly distributed, while the generic data at higher claim counts has less spread.

Dosage units represent the concentration and quantities of medication dispensed per prescription filled. The trends show that generic medications, especially at higher demand, are may be prescribed in bulk or at high standardized doses. In comparison, brand name drugs may have more variance in prescription due to more dosage levels, or prescription regulations.

kNN

Row

Plotted Data

Prediction Table

Confusion Matrix and Statistics

          Reference
Prediction TRUE FALSE
     TRUE   178   112
     FALSE  183   519
                                          
               Accuracy : 0.7026          
                 95% CI : (0.6731, 0.7309)
    No Information Rate : 0.6361          
    P-Value [Acc > NIR] : 5.819e-06       
                                          
                  Kappa : 0.3294          
                                          
 Mcnemar's Test P-Value : 4.590e-05       
                                          
            Sensitivity : 0.4931          
            Specificity : 0.8225          
         Pos Pred Value : 0.6138          
         Neg Pred Value : 0.7393          
             Prevalence : 0.3639          
         Detection Rate : 0.1794          
   Detection Prevalence : 0.2923          
      Balanced Accuracy : 0.6578          
                                          
       'Positive' Class : TRUE

Row

Text

Research Question: Can we use the medicaid spending and claims over 2018-2022 to determine if there are multiple manufactures of a medication?

The kNN classification model, developed to predict whether a medication has multiple manufactures based on the total spending and total claims from the years 2018 to 2022, achieved a success rate of 70.26% on the testing set. This success rate is better than the no information success rate of 63.61%. The model has a p-value of (5.819e-6) near zero which tells us that our model’s ability to make predictions is significant. The model’s sensitivity of 49.31% suggest that it has a not as good ability in correctly identifying medications with multiple manufactures, but the specificity of 82.25% suggest that our model has a much ability in identifying medications that only have one manufacture. With this model we expect a misclassifcation rate of around 29.74%

Naive Bayes

Column

Proportion of Generic and Brand Name Drugs in Dataset

Brand vs. Generic Drugs in Prior, Training, and Testing Sets
Category	Prior	Train	Test
Brand	0.8249	0.822	0.8318
Generic	0.1751	0.178	0.1682

The output shows that roughly 82.5% of the data represents brand name drugs, while around 17.5% represents generic drugs. Although a more even split would be ideal for a naive bayes model, this is the data distribution available in the Medicaid claims dataset.

Overall Accuracy

Classification Accuracy
Proportion_Correct	Missclassification
0.8503	0.1631

Overall, the Naive Bayes model of generic and brand name drugs has a classification accuracy of around 0.85, with a misclassification rate of around 0.16. This indicates reasonable accuracy, although significant improvements can still be made to improve the classification rate.

Confusion Matrix


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  993 

 
             | Actual 
   Predicted |     Brand |   Generic | Row Total | 
-------------|-----------|-----------|-----------|
       Brand |       689 |        25 |       714 | 
             |     0.834 |     0.150 |           | 
-------------|-----------|-----------|-----------|
     Generic |       137 |       142 |       279 | 
             |     0.166 |     0.850 |           | 
-------------|-----------|-----------|-----------|
Column Total |       826 |       167 |       993 | 
             |     0.832 |     0.168 |           | 
-------------|-----------|-----------|-----------|

The confusion matrix above displays the results comparing the predicted drug type classifications of the test set using the Naive Bayes model. The model was able to correctly classify 83.4% of brand name drugs, and 85% of generic drugs. Exactly 15% of the generic drugs were falsely classified as brand name, and 16.6% of brand name drugs were classified as generic. One explanation for the misclassifications is the imbalance in the dataset. The majority of the observations are brand name drugs, with only ~ 17.5% of the data representing generics. With increased Medicaid claims data on generic drugs, there may be an increase in model accuracy.

ROC Curve

The ROC curve above shows the classification performance of our model on the testing dataset. The plotted line approaches the top left corner of the model, indicating high sensitivity and high specificity. There is also a large area under the curve, however we still see room for improvement in terms of increasing sensitivity and specificity values. Again, this could be improved with more data on generic drugs.

Column

Research Question: Can we predict if a drug is brand name or generic based on the Medicaid spending trends?

Conclusion: The analysis shows that the Naive Bayes model was able to correctly classify ~ 85% of brand name and generic drugs. Based on the current performance, the model is able to predict drug type from Medicaid spending trends with reasonable accuracy. With more observations on generic drugs to add to the Medicaid claims dataset, the model accuracy could be significantly improved.

Logistic Regression

Column

Data Plotted

ANOVA and Predictions

Analysis of Deviance Table

Model 1: Multiple_Mftr ~ 1
Model 2: Multiple_Mftr ~ Tot_Clms_2022 + Tot_Spndng_2022
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      3306     4314.1                          
2      3304     3632.8  2   681.38 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           
Predictions   0   1
        No  595 197
        Yes  37 163

When we use the model to make predictions on the testing data using a threshold of 0.35, the model is able to predict 593 of the true no’s correctly and 33 of the true no’s wrong and 172 of the true yes’s correctly with 194 of them wrong. This gives us an overall success rate of 77.1% which is good and much better to help make predictions compared to if we didn’t use this model. The difference between the model correctly prediction yes’s vs. no’s is large with a 94.7% success rate when predicting a No on the true no’s, but only a 47.% success rate for the model predicting a yes on the true yes’s. This information could come in handy when it comes to using this model to make predictions and you can have more confidence in getting a true no compared to a true yes.

Prediction Plot 1

Prediction Plot 2

Column

Text

Research Question: Can the total spending and claims for drugs using medicaid during the year 2022 help us predict if a medication is made by multiple manufactures?

Conclusion: Using the data to make a Logistic Regression model on the testing data we were able to determine if a medication has multiple manufactures with a success rate of 75.6% on our testing data with a threshold of 0.35. The model has the following coefficients

Logistic Regression Coefficients
Variables	Values
Intercept	-7.777877e-01
Total Claims in 2022	7.574508e-06
Total Spending 2022	-2.548977e-08

y = -0.7777877 + 7.574508e-06x₁ - 2.548977e-08x₂

a misclassification error of

[1] 0.1814865

and a p-value of

[1] 1.096465e-148

Since this model is able to make predictions with a low misclassification error rate and has a p-value of near zero We are able to confidently say that this model can be used to predict if a medication has multiple manufactures based on the spending and claims data from 2022. This model is most likely only good to use until we have data for 2023 because that will give us more up to date information compared to the 2022 data we are using for the model.