论文标题
将多物种占用模型缩放到大型公民科学数据集
Scaling multi-species occupancy models to large citizen science datasets
论文作者
论文摘要
公民科学数据集可能非常大,并且有望改善物种分布建模,但是检测是不完美的,在安装模型时冒着偏见的风险。特别是,观察者可能无法检测到实际存在的物种。占用模型可以估计并纠正此观察过程,并且多种物种的占用模型利用了观察过程中的相似性,这可以改善对稀有物种的估计。但是,目前用于拟合这些模型的计算方法不能扩展到大型数据集。我们开发了近似的贝叶斯推理方法,并使用图形处理单元(GPU)将多物种占用模型扩展到非常大的公民科学数据。我们将多物种占用模型拟合到来自eBird项目的一个月数据,该项目由186,811个清单记录组成,其中包括430种鸟类。我们评估了59,338条记录的空间分离测试集的预测,比较了两种不同的推理方法 - 马尔可夫链蒙特卡洛(MCMC)和变异推理(VI) - 使用最大可能性分别拟合到每个物种的占用模型。我们使用VI将模型拟合到整个数据集,并使用MCMC拟合了32,000个记录。安装在整个数据集中的VI表现最佳,在AUC上表现优于单物种模型(90.4%,而相比88.7%)和对数的可能性(-0.080),而不是-0.085)。我们还评估了该模型预测的范围地图与专家图的一致。我们发现,建模检测过程大大改善了一致性,并且所得的地图与使用高质量调查数据估计的图表与专家图密切一致。我们的结果表明,多物种占用模型是对大型公民科学数据集建模的令人信服的方法,并且一旦考虑到观察过程,它们就可以准确地对物种分布进行建模。
Citizen science datasets can be very large and promise to improve species distribution modelling, but detection is imperfect, risking bias when fitting models. In particular, observers may not detect species that are actually present. Occupancy models can estimate and correct for this observation process, and multi-species occupancy models exploit similarities in the observation process, which can improve estimates for rare species. However, the computational methods currently used to fit these models do not scale to large datasets. We develop approximate Bayesian inference methods and use graphics processing units (GPUs) to scale multi-species occupancy models to very large citizen science data. We fit multi-species occupancy models to one month of data from the eBird project consisting of 186,811 checklist records comprising 430 bird species. We evaluate the predictions on a spatially separated test set of 59,338 records, comparing two different inference methods -- Markov chain Monte Carlo (MCMC) and variational inference (VI) -- to occupancy models fitted to each species separately using maximum likelihood. We fitted models to the entire dataset using VI, and up to 32,000 records with MCMC. VI fitted to the entire dataset performed best, outperforming single-species models on both AUC (90.4% compared to 88.7%) and on log likelihood (-0.080 compared to -0.085). We also evaluate how well range maps predicted by the model agree with expert maps. We find that modelling the detection process greatly improves agreement and that the resulting maps agree as closely with expert maps as ones estimated using high quality survey data. Our results demonstrate that multi-species occupancy models are a compelling approach to model large citizen science datasets, and that, once the observation process is taken into account, they can model species distributions accurately.