11 References

Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., & Williamson, D. M. (2015). Bayesian networks in educational assessment. Springer. https://doi.org/10.1007/978-1-4939-2125-6
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Babu, G. J. (2011). Resampling methods for model fitting and model selection. Journal of Biopharmaceutical Statistics, 21(6), 1177–1186. https://doi.org/10.1080/10543406.2011.607749
Betancourt, M. (2018). A conceptual introduction to Hamiltonian Monte Carlo. arXiv. http://arxiv.org/abs/1701.02434
Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp & J. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (1st ed., pp. 297–327). John Wiley & Sons. https://doi.org/10.1002/9781118956588.ch13
Bradshaw, L., Izsák, A., Templin, J., & Jacobson, E. (2014). Diagnosing teachers’ understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33(1), 2–14. https://doi.org/10.1111/emip.12020
Bradshaw, L., & Levy, R. (2019). Interpreting probabilistic classifications from diagnostic psychometric models. Educational Measurement: Issues and Practice, 38(2), 79–88. https://doi.org/10.1111/emip.12247
Camilli, G., & Shepard, L. A. (1994). Method for Identifying Biased Test Items (4th ed.). SAGE Publications, Inc.
Carlin, B. P., & Louis, T. A. (2001). Empirical Bayes: Past, present and future. In A. E. Raftery, M. A. Tanner, & M. T. Wells (Eds.), Statistics in the 21st century. Chapman and Hall/CRC. https://doi.org/10.1201/9781420035391
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01
Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87. https://doi.org/10.2307/2682801
Chen, J., Torre, J. de la, & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123–140. https://doi.org/10.1111/j.1745-3984.2012.00185.x
Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. https://doi.org/10.1016/0895-4356(90)90159-M
Clark, A. K., Karvonen, M., Swinburne Romine, R., & Kingston, N. (2018). Teacher use of score reports for instructional decision-making: Preliminary findings. National Council on Measurement in Education Annual Meeting, New York, NY. https://dynamiclearningmaps.org/sites/default/files/documents/presentations/NCME_2018_Score_Report_Use_Findings.pdf
Clark, A. K., Kobrin, J., & Hirt, A. (2022). Educator perspectives on instructionally embedded assessment (Research Synopsis No. 22-01). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/IE_Focus_Groups_project_brief.pdf
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Routledge.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. https://doi.org/10.1037//0033-2909.112.1.155
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1177/0146621612445470
DLM Consortium. (2021). Test Administration Manual 2021–2022. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2017). 2015–2016 Technical Manual—Science. University of Kansas, Center for Educational Testing and Evaluation.
Dynamic Learning Maps Consortium. (2018a). 2016–2017 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2018b). 2017–2018 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2019). 2018–2019 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2020). 2019–2020 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2021a). 2020–2021 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2021b). Accessibility Manual 2021–2022. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2021c). Educator Portal User Guide. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2022a). 2021–2022 Technical Manual—Instructionally Embedded Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Dynamic Learning Maps Consortium. (2022b). 2021–2022 Technical Manual—Year-End Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.
Efron, B. (2014). Two modeling strategies for empirical Bayes estimation. Statistical Science, 29(2), 285–301. https://doi.org/10.1214/13-STS455
Falconer, J. R., Frank, E., Polaschek, D. L. L., & Joshi, C. (2022). Methods for eliciting informative prior distributions: A critical review. Decision Analysis. https://doi.org/10.1287/deca.2022.0451
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. https://doi.org/10.1016/0895-4356(90)90158-L
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. https://doi.org/10.1007/BF02288892
Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. https://doi.org/10.1177/0146621604272623
Henson, R., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. https://doi.org/10.1007/s11336-008-9089-5
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. http://www.jstor.org/stable/4615733
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power raters using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349. https://doi.org/10.1207/S15324818AME1404_2
Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55(4), 635–664. https://doi.org/10.1111/jedm.12196
Karvonen, M., Bechard, S., & Wells-Moreaux, S. (2015). Accessibility considerations for students with significant cognitive disabilities who take computer-based alternate assessments. Paper presentation. American educational research association annual meeting, Chicago, IL.
Karvonen, M., Wakeman, S. Y., Browder, D. M., Rogers, M. A. S., & Flowers, C. (2011). Academic curriculum for students with significant cognitive disabilities: Special education teacher perspectives a decade after IDEA 1997 [Research Report]. National Alternate Assessment Center. https://files.eric.ed.gov/fulltext/ED521407.pdf
Kobrin, J., Clark, A. K., & Kavitsky, E. (2022). Exploring educator perspectives on potential accessibility gaps in the Dynamic Learning Maps alternate assessment (Research Synopsis No. 22-02). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/Accessibility_Focus_Groups_project_brief.pdf
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
Leighton, J., & Gierl, M. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge University Press. https://doi.org/10.1017/CBO9780511611186
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528. https://doi.org/10.1007/BF01589116
Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212. https://doi.org/10.1007/BF02294535
Mislevy, R. J., & Gitomer, D. H. (1995). The role of probability-based inference in an intelligent tutoring system. User Modeling and User-Adapted Interaction, 5(3–4), 253–282. https://doi.org/10.1007/BF01126112
Nabi, S., Nassif, H., Hong, J., Mamani, H., & Imbens, G. (2022). Bayesian meta-prior learning using empirical Bayes. Management Science, 68(3), 1737–1755. https://doi.org/10.1287/mnsc.2021.4136
National Research Council. (2012). A Framework for K-12 science education: Practice, crosscutting concepts, and core ideas. The National Academies Press.
Neal, R. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, & X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo (Vol. 20116022). Chapman and Hall/CRC. https://doi.org/10.1201/b10905-6
NGSS Lead States. (2013). Next Generation Science Standards: For States, by States. The National Academies Press.
Nitsch, C. (2013). Dynamic Learning Maps: The Arc parent focus groups. The Arc. https://dynamiclearningmaps.org/sites/default/files/documents/publication/TheArcParentFocusGroups.pdf
Nocedal, J., & Wright, S. J. (2006). Numerical optimization. Springer. https://doi.org/10.1007/978-0-387-40065-5
O’Leary, S., Lund, M., Ytre-Hauge, T. J., Holm, S. R., Naess, K., Dalland, L. N., & McPhail, S. M. (2014). Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies. Physiotherapy, 100, 27–35. https://doi.org/10.1016/j.physio.2013.08.002
Pearl, J. (1988). Probabilistic reasoning in intelligent systems. Morgan Kaufmann. https://doi.org/10.1016/C2009-0-27609-4
Petrone, S., Rousseau, J., & Scricciolo, C. (2014). Bayes and empirical Bayes: Do they merge? Biometrika, 101(2), 285–302. https://doi.org/10.1093/biomet/ast067
Pontius, R. G., Jr., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32, 4407–4429. https://doi.org/10.1080/01431161.2011.552923
Ravand, H., & Baghaei, P. (2020). Diagnostic classification models: Recent developments, practical issues, and prospects. International Journal of Testing, 20(1), 24–56. https://doi.org/10.1080/15305058.2019.1588278
Rupp, A. A., Templin, J., & Henson, R. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.
Sinharay, S., & Johnson, M. S. (2019). Measures of agreement: Reliability, classification accuracy, and classification consistency. In M. von Davier & Y.-S. Lee (Eds.), Handbook of Diagnostic Classification Models (pp. 359–377). Springer International Publishing. https://doi.org/10.1007/978-3-030-05584-4_17
Stan Development Team. (2022). RStan: The R interface to Stan. https://mc-stan.org/
Stefan, A. M., Evans, N. J., & Wagenmakers, E.-J. (2020). Practical challenges and methodological flexibility in prior elicitation. Psychological Methods, 27(2), 177–197. https://doi.org/10.1037/met0000354
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://www.jstor.org/stable/1434855
Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30(2), 251–275. https://doi.org/10.1007/s00357-013-9129-4
Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339. https://doi.org/10.1007/s11336-013-9362-0
Templin, J., & Henson, R. (2008). Understanding the impact of skill acquisition: Relating diagnostic assessments to measurable outcomes. Paper presentation. American Educational Research Association Annual Meeting, New York, NY.
Thompson, W. J. (2020). Reliability for the Dynamic Learning Maps assessments: A comparison of methods (Technical Report No. 20-03). University of Kansas; Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/Reliability_Comparison.pdf
Thompson, W. J. (2019). Bayesian psychometrics for diagnostic assessments: A proof of concept (Research Report No. 19-01). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://doi.org/10.35542/osf.io/jzqs8
Thompson, W. J., Clark, A. K., & Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32(4), 298–309. https://doi.org/10.1080/08957347.2019.1660345
Thompson, W. J., & Nash, B. (2022). A diagnostic framework for the empirical evaluation of learning maps. Frontiers in Education, 6, 714736. https://doi.org/10.3389/feduc.2021.714736
Thompson, W. J., & Nash, B. (2019). Beyond learning progressions: Maps as assessment architecture: Illustrations and results. Symposium. National Council on Measurement in Education Annual Meeting, Toronto, Canada. https://dynamiclearningmaps.org/sites/default/files/documents/presentations/Thompson_Nash_Empirical_evaluation_of_learning_maps.pdf
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. Bayesian Analysis, 16(2), 667–718. https://doi.org/10.1214/20-BA1221
Wang, W., Song, L., Chen, P., Meng, Y., & Ding, S. (2015). Attribute-level and pattern-level classification consistency and accuracy indices for cognitive diagnostic assessment. Journal of Educational Measurement, 52(4), 457–476. https://doi.org/10.1111/jedm.12096
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF [{Working Paper}]. University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.