Skip to main content

Commentary to: a cross-validation-based approach for delimiting reliable home range estimates

The Original Article was published on 06 September 2017

Abstract

Background

Continued exploration of the performance of the recently proposed cross-validation-based approach for delimiting home ranges using the Time Local Convex Hull (T-LoCoH) method has revealed a number of issues with the original formulation.

Main text

Here we replace the ad hoc cross-validation score with a new formulation based on the total log probability of out-of-sample predictions. To obtain these probabilities, we interpret the normalized LoCoH hulls as a probability density. The application of the approach described here results in optimal parameter sets that differ dramatically from those selected using the original formulation. The derived metrics of home range size, mean revisitation rate, and mean duration of visit are also altered using the corrected formulation.

Conclusion

Despite these differences, we encourage the use of the cross-validation-based approach, as it provides a unifying framework governed by the statistical properties of the home ranges rather than subjective selections by the user.

Background

Continued exploration of the the cross-validation-based approach proposed in [1] has revealed a number of issues with the original formulation of the optimization equation. This original formulation was ad hoc in its combination of two statistical approaches (cross-validation and information criteria), and the result was a metric without a clear basis in statistical theory. As such, we strongly recommend that users rely upon the method described here as opposed to one set forth in the original publication. In particular, the shortcomings can be summarized as follows:

  1. 1.

    Both cross-validation and information criterion approaches aim to avoid over-fitting. In the case of cross-validation, one attempts to estimate out-of-sample prediction error, so the score used should be a measure of prediction errors of the held-out points. If the model uses k too small or s too large, it is likely to overfit the training data and will predict the testing data poorly. On the other hand, if the model uses k too large or s too small, it will underfit the training data by missing the real variations in space use. Thus, cross-validation naturally penalizes model complexity because excessive complexity (small k) results in poor predictions. Information criteria approaches include a penalty term that increases with model complexity as measured by larger numbers of parameters. Using such an information criterion as a cross-validation score is not necessary since cross-validation should naturally penalize excessive model complexity.

  2. 2.

    The formulation of the information criterion score did not follow the rules of probability because probabilities of out-of-sample predictions were not properly normalized, and multiple probabilities were combined by summation. In this sense, it lacked a firm connection to the statistical theory underlying information criteria approaches.

Here we propose an alternative formulation in which we interpret a normalized version of LoCoH hulls as an estimated probability surface and recast the cross-validation score as the total log probability of out-of-sample predictions, a common choice in cross-validation schemes. The approach, explained in detail below, results in more appropriate behavior, but also has the effect of significantly altering the optimal parameter values selected by the algorithm. Thus, in addition to presenting the new cross-validation equation, we include tables and figures with the newly selected parameter values and newly calculated derived metric values (home range area, mean duration, and mean visitation rates). Finally, we offer an alternative R script that searches a much broader parameter space in a more efficient manner (Additional file 1).

Updated Cross-Validation Approach

Using the training/testing split as described in the original presentation of the algorithm, a grid-based exploration of parameter space was conducted (Fig. 1), whereby each of the training/testing datasets (i={1,...,n}) was analyzed at every combination of k and s values on the grid. This analysis entailed the creation of local convex hulls with k nearest neighbors and a scaling factor of s. In all subsequent analyses, we assume that the scaling of time follows a linear formulation; however, when movement patterns more closely exemplify diffusion dynamics, an alternative equation for the TSD may be more appropriate [2]. The test points (j={1,...,m}) were then laid upon the resulting hulls.

Fig. 1
figure 1

Conceptual Figure of Grid-based Search. A cross-validation surface is generated as the algorithm searches over a grid of alternative s and k values for each individual movement path. The increments of the grid can be chosen by the user. The peak in the surface indicates that the home range associated with the particular parameter set offers the highest probability for the test points. Here, the white boxes denote the maximum probability value, and thereby, the optimal parameter set

We formulate the probabilities for out-of-sample points by normalizing the LoCoH surface so that the probability of an observation occurring at a particular location can be calculated. This value is obtained by dividing the number of training hulls that contain the test point location (gi,j) by the summed area of all training hulls (Ai). Then, the log probability was calculated for each point per training hullset. To avoid log probability values of - ∞, test points that were not contained within any hulls were assigned a probability value equal to the inverse of \(A_{i}^{2}\), resulting in a substantially lower log probability than that of a test point contained in a single hull. Finally, a single value (Pk,s) was assigned to each combination of k and s value by summing across all of the test points in all of the training/testing datasets:

$${P_{k,s}} = \sum_{i=1}^{n} \sum_{j=1}^{m} \log\left(\frac{g_{i,j}}{A_{i}}\right) $$

Because the probability of each test point is normalized based on the total area contained within all of the training hulls, there exists a natural penalty for high k values. For example, a k value equal to the number of training points (kmax; regardless of the s value) will result in all hulls being identical and each test point overlapping all of the hulls. However, the large total area of the hullset when k=kmax will result in relatively small probability values for each test point (i.e., independent probability values equal to the inverse of the area of one of the hulls), effectively penalizing the parameter set containing kmax. The underlying cross-validation procedure could very easily be extended for the optimization of the the adaptive parameter in the a-method (as opposed to the k-method) because of its scaling with the total area of the hullset.

Results

The optimal parameter values selected using the corrected cross-validation method are substantially different from those selected in the original publication (Table 1). However, because the original formulation was not supported by cohesive statistical theory, we will discuss these new results only in reference to the guideline-based parameter values rather than comparing them to the results emerging from the published algorithm. The mean s value selected using the algorithm for springbok was 0.02 (SE = 0.008) and for zebra was 0.0012 (SE = 0.0005). The mean s value selected using the guidelines for springbok was 0.005 (SE = 0.002) and 0.017 (SE = 0.002) for zebra. Thus, the s values selected by the algorithm and the guidelines were not significantly different for springbok (p=0.10), but were for zebra (p<0.001). In the case of the k values, the optimal values selected using the algorithm were significantly higher than those resulting from the guidelines. The mean k value selected using the algorithm for springbok was 225.5 (SE = 66.83) whereas the mean using the guidelines was 22.5 (SE = 1.71; p=0.003). The same trend was observed in zebra where the mean k value based on the algorithm was 347.2 (SE = 54.36), whereas the mean from the guidelines was 20 (SE = 1.58; p=0.004).

Table 1 Parameter values for analysis

The significantly higher k values emerging from the algorithm gave rise to significantly larger home ranges in both species (Table 2). In springbok, the mean home range size was 265.41 km2 (SE = 76.23 km2) using the high end of the guideline based range, and 401.64 km2 (SE = 127.56 km2) using the algorithm (p=0.05). In zebra, the mean home range was 694.43 km2 (SE = 80.81 km2) using the guidelines and 1081.29 km2 (SE = 162.17 km2) when the algorithm was applied (p=0.01). When the derived metrics were considered, however, the substantial differences in k values did not always result in significantly different duration (Table 3) and visitation rates (Table 4). Though the duration rates in zebra derived from the algorithm were, indeed, significantly higher than those derived using the high value from the range based on the guidelines (p=0.05), this was not the case for springbok (p=0.08). Similarly, the visitation rates emerging from the parameter sets selected by the algorithm were not significantly different from those derived based on the guidelines in either species (p=0.33 in springbok and p=0.15 in zebra).

Table 2 Home range areas (in square kilometers)
Table 3 Mean duration (MNLV) values. The derived metrics obtained using the parameter sets recommended by the algorithm and by the guidelines set forth in the T-LoCoH documentation
Table 4 Mean visitation (NSV) values

Conclusion

The results presented here indicate that the effect of selecting parameters using the algorithm rather than the guidelines will be highly contingent upon the focus of the research question. Where home range delineation is the goal, the results are likely to differ significantly (Fig. 2). In the case of epidemiological questions, however, the effects will be somewhat less predictable, and in certain cases, similar conclusions might be drawn irrespective of the approach used for selecting optimal parameters. If an element of the analysis involves comparisons across individuals or species, however, the cross-validation-based approach provides a unifying framework governed by statistical properties of the home ranges rather than subjective selections by the user.

Fig. 2
figure 2

Comparison of Resulting Home Ranges. An illustration of two sets of home ranges that result from the parameter sets chosen by the algorithm (red), the low range of the guide (blue), and the high range of the guide (black). The home range set on the left is based on the sample points from the springbok AG207, and the largest home range covers 429.81 km2. The home range set on the right is based on the GPS fixes from zebra AG256, and the largest home range covers 1363.21 km2

Fig. 3
figure 3

High Resolution Cross-Validation Surface. A high resolution depiction of a portion of the optimal parameter space traversed during the final stage of the efficient search algorithm. All parameter sets with log probability values above -10090 are shown, with darker shading indicating higher probability. In this particular application, the search is performed over smaller intervals of s (0.0001 rather than 0.001), and the optimal parameter set (k=171 and s=0.0133) is similar to the parameter set selected at the coarser scale

References

  1. Dougherty ER, Carlson CJ, Blackburn JK, Getz WM. A cross-validation-based approach for delimiting reliable home range estimates. Mov Ecol. 2017; 5(1):19.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Lyons AJ, Turner WC, Getz WM. Home range plus: a space-time characterization of movement over real landscapes. Mov Ecol. 2013; 1(1):2.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would also like to acknowledge Andy Lyons for creating, maintaining, and improving the T-LoCoH package.

Funding

The case study presented here used GPS movement data from zebra and springbok from Etosha National Park, Namibia, which were collected under a grant obtained by WMG (NIH GM083863). In addition, partial funding for this study was provided by NIH 1R01GM117617-01 to JKB and WMG. The funders had no role in study design, data collection and analysis, nor manuscript writing.

Availability of data and materials

Please contact Wayne M. Getz (wgetz@berkeley.edu) for data requests.

Author information

Authors and Affiliations

Authors

Contributions

PDV and ERD developed cross-validation approach. ERD ran analyses on empirical movement paths. All authors contributed to writing and editing the manuscript.

Corresponding author

Correspondence to Eric R. Dougherty.

Ethics declarations

Ethics approval and consent to participate

All movement data were collected according to the animal handling protocol AUP R217-0509B (University of California, Berkeley).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1

A new R script for a more efficient grid-based search (Fig. 3) can be found at: https://github.com/doughertyeric/Updated_T-LoCoH_Algorithm. As currently parameterized, the grid-based search algorithm covers s values from 0 to 0.05 and k values between 4 and 800. The algorithm searches across the broadest set of k values in intervals of 20 and s values in intervals of 0.01. Upon identifying a peak in the probability surface, the algorithm selects a range of 40 k values around the peak and refines the search there in k value increments of 5. Finally, another range of 10 possible k values is selected and the finest scale grid-search is conducted in intervals of 1 and s value intervals of 0.001 before selecting the optimal parameter set. (R 11 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dougherty, E.R., de Valpine, P., Carlson, C.J. et al. Commentary to: a cross-validation-based approach for delimiting reliable home range estimates. Mov Ecol 6, 10 (2018). https://doi.org/10.1186/s40462-018-0128-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40462-018-0128-2

Keywords