11 References

Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., & Williamson, D. M. (2015). Bayesian networks in educational assessment. Springer. https://doi.org/10.1007/978-1-4939-2125-6

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Babu, G. J. (2011). Resampling methods for model fitting and model selection. Journal of Biopharmaceutical Statistics, 21(6), 1177–1186. https://doi.org/10.1080/10543406.2011.607749

Betancourt, M. (2018). A conceptual introduction to Hamiltonian Monte Carlo. arXiv. http://arxiv.org/abs/1701.02434

Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp & J. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (1st ed., pp. 297–327). John Wiley & Sons. https://doi.org/10.1002/9781118956588.ch13

Bradshaw, L., Izsák, A., Templin, J., & Jacobson, E. (2014). Diagnosing teachers’ understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33(1), 2–14. https://doi.org/10.1111/emip.12020

Bradshaw, L., & Levy, R. (2019). Interpreting probabilistic classifications from diagnostic psychometric models. Educational Measurement: Issues and Practice, 38(2), 79–88. https://doi.org/10.1111/emip.12247

Camilli, G., & Shepard, L. A. (1994). Method for Identifying Biased Test Items (4th ed.). SAGE Publications, Inc.

Carlin, B. P., & Louis, T. A. (2001). Empirical Bayes: Past, present and future. In A. E. Raftery, M. A. Tanner, & M. T. Wells (Eds.), Statistics in the 21st century. Chapman and Hall/CRC. https://doi.org/10.1201/9781420035391

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01

Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87. https://doi.org/10.2307/2682801

Chen, J., Torre, J. de la, & Zhang, Z. (2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50(2), 123–140. https://doi.org/10.1111/j.1745-3984.2012.00185.x

Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. https://doi.org/10.1016/0895-4356(90)90159-M

Clark, A. K., Karvonen, M., Swinburne Romine, R., & Kingston, N. (2018). Teacher use of score reports for instructional decision-making: Preliminary findings. National Council on Measurement in Education Annual Meeting, New York, NY. https://dynamiclearningmaps.org/sites/default/files/documents/presentations/NCME_2018_Score_Report_Use_Findings.pdf

Clark, A. K., Kobrin, J., & Hirt, A. (2022). Educator perspectives on instructionally embedded assessment (Research Synopsis No. 22-01). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/IE_Focus_Groups_project_brief.pdf

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Routledge.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. https://doi.org/10.1037//0033-2909.112.1.155

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1177/0146621612445470

DLM Consortium. (2021). Test Administration Manual 2021–2022. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2017). 2015–2016 Technical Manual—Science. University of Kansas, Center for Educational Testing and Evaluation.

Dynamic Learning Maps Consortium. (2018a). 2016–2017 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2018b). 2017–2018 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2019). 2018–2019 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2020). 2019–2020 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2021a). 2020–2021 Technical Manual Update—Science. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2021b). Accessibility Manual 2021–2022. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2021c). Educator Portal User Guide. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2022a). 2021–2022 Technical Manual—Instructionally Embedded Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Dynamic Learning Maps Consortium. (2022b). 2021–2022 Technical Manual—Year-End Model. University of Kansas, Accessible Teaching, Learning, and Assessment Systems.

Efron, B. (2014). Two modeling strategies for empirical Bayes estimation. Statistical Science, 29(2), 285–301. https://doi.org/10.1214/13-STS455

Falconer, J. R., Frank, E., Polaschek, D. L. L., & Joshi, C. (2022). Methods for eliciting informative prior distributions: A critical review. Decision Analysis. https://doi.org/10.1287/deca.2022.0451

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. https://doi.org/10.1016/0895-4356(90)90158-L

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. https://doi.org/10.1007/BF02288892

Henson, R., & Douglas, J. (2005). Test construction for cognitive diagnosis. Applied Psychological Measurement, 29(4), 262–277. https://doi.org/10.1177/0146621604272623

Henson, R., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. https://doi.org/10.1007/s11336-008-9089-5

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. http://www.jstor.org/stable/4615733

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power raters using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349. https://doi.org/10.1207/S15324818AME1404_2

Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55(4), 635–664. https://doi.org/10.1111/jedm.12196

Karvonen, M., Bechard, S., & Wells-Moreaux, S. (2015). Accessibility considerations for students with significant cognitive disabilities who take computer-based alternate assessments. Paper presentation. American educational research association annual meeting, Chicago, IL.

Karvonen, M., Wakeman, S. Y., Browder, D. M., Rogers, M. A. S., & Flowers, C. (2011). Academic curriculum for students with significant cognitive disabilities: Special education teacher perspectives a decade after IDEA 1997 [Research Report]. National Alternate Assessment Center. https://files.eric.ed.gov/fulltext/ED521407.pdf

Kobrin, J., Clark, A. K., & Kavitsky, E. (2022). Exploring educator perspectives on potential accessibility gaps in the Dynamic Learning Maps alternate assessment (Research Synopsis No. 22-02). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/Accessibility_Focus_Groups_project_brief.pdf

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

Leighton, J., & Gierl, M. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge University Press. https://doi.org/10.1017/CBO9780511611186

Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528. https://doi.org/10.1007/BF01589116

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212. https://doi.org/10.1007/BF02294535

Mislevy, R. J., & Gitomer, D. H. (1995). The role of probability-based inference in an intelligent tutoring system. User Modeling and User-Adapted Interaction, 5(3–4), 253–282. https://doi.org/10.1007/BF01126112

Nabi, S., Nassif, H., Hong, J., Mamani, H., & Imbens, G. (2022). Bayesian meta-prior learning using empirical Bayes. Management Science, 68(3), 1737–1755. https://doi.org/10.1287/mnsc.2021.4136

National Research Council. (2012). A Framework for K-12 science education: Practice, crosscutting concepts, and core ideas. The National Academies Press.

Neal, R. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, & X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo (Vol. 20116022). Chapman and Hall/CRC. https://doi.org/10.1201/b10905-6

NGSS Lead States. (2013). Next Generation Science Standards: For States, by States. The National Academies Press.

Nitsch, C. (2013). Dynamic Learning Maps: The Arc parent focus groups. The Arc. https://dynamiclearningmaps.org/sites/default/files/documents/publication/TheArcParentFocusGroups.pdf

Nocedal, J., & Wright, S. J. (2006). Numerical optimization. Springer. https://doi.org/10.1007/978-0-387-40065-5

O’Leary, S., Lund, M., Ytre-Hauge, T. J., Holm, S. R., Naess, K., Dalland, L. N., & McPhail, S. M. (2014). Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies. Physiotherapy, 100, 27–35. https://doi.org/10.1016/j.physio.2013.08.002

Pearl, J. (1988). Probabilistic reasoning in intelligent systems. Morgan Kaufmann. https://doi.org/10.1016/C2009-0-27609-4

Petrone, S., Rousseau, J., & Scricciolo, C. (2014). Bayes and empirical Bayes: Do they merge? Biometrika, 101(2), 285–302. https://doi.org/10.1093/biomet/ast067

Pontius, R. G., Jr., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32, 4407–4429. https://doi.org/10.1080/01431161.2011.552923

Ravand, H., & Baghaei, P. (2020). Diagnostic classification models: Recent developments, practical issues, and prospects. International Journal of Testing, 20(1), 24–56. https://doi.org/10.1080/15305058.2019.1588278

Rupp, A. A., Templin, J., & Henson, R. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.

Sinharay, S., & Johnson, M. S. (2019). Measures of agreement: Reliability, classification accuracy, and classification consistency. In M. von Davier & Y.-S. Lee (Eds.), Handbook of Diagnostic Classification Models (pp. 359–377). Springer International Publishing. https://doi.org/10.1007/978-3-030-05584-4_17

Stan Development Team. (2022). RStan: The R interface to Stan. https://mc-stan.org/

Stefan, A. M., Evans, N. J., & Wagenmakers, E.-J. (2020). Practical challenges and methodological flexibility in prior elicitation. Psychological Methods, 27(2), 177–197. https://doi.org/10.1037/met0000354

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://www.jstor.org/stable/1434855

Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30(2), 251–275. https://doi.org/10.1007/s00357-013-9129-4

Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339. https://doi.org/10.1007/s11336-013-9362-0

Templin, J., & Henson, R. (2008). Understanding the impact of skill acquisition: Relating diagnostic assessments to measurable outcomes. Paper presentation. American Educational Research Association Annual Meeting, New York, NY.

Thompson, W. J. (2020). Reliability for the Dynamic Learning Maps assessments: A comparison of methods (Technical Report No. 20-03). University of Kansas; Accessible Teaching, Learning, and Assessment Systems. https://dynamiclearningmaps.org/sites/default/files/documents/publication/Reliability_Comparison.pdf

Thompson, W. J. (2019). Bayesian psychometrics for diagnostic assessments: A proof of concept (Research Report No. 19-01). University of Kansas, Accessible Teaching, Learning, and Assessment Systems. https://doi.org/10.35542/osf.io/jzqs8

Thompson, W. J., Clark, A. K., & Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32(4), 298–309. https://doi.org/10.1080/08957347.2019.1660345

Thompson, W. J., & Nash, B. (2022). A diagnostic framework for the empirical evaluation of learning maps. Frontiers in Education, 6, 714736. https://doi.org/10.3389/feduc.2021.714736

Thompson, W. J., & Nash, B. (2019). Beyond learning progressions: Maps as assessment architecture: Illustrations and results. Symposium. National Council on Measurement in Education Annual Meeting, Toronto, Canada. https://dynamiclearningmaps.org/sites/default/files/documents/presentations/Thompson_Nash_Empirical_evaluation_of_learning_maps.pdf

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC. Bayesian Analysis, 16(2), 667–718. https://doi.org/10.1214/20-BA1221

Wang, W., Song, L., Chen, P., Meng, Y., & Ding, S. (2015). Attribute-level and pattern-level classification consistency and accuracy indices for cognitive diagnostic assessment. Journal of Educational Measurement, 52(4), 457–476. https://doi.org/10.1111/jedm.12096

Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF [{Working Paper}]. University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.