Commentary to: a cross-validation-based approach for delimiting reliable home range estimates

Background Continued exploration of the performance of the recently proposed cross-validation-based approach for delimiting home ranges using the Time Local Convex Hull (T-LoCoH) method has revealed a number of issues with the original formulation. Main text Here we replace the ad hoc cross-validation score with a new formulation based on the total log probability of out-of-sample predictions. To obtain these probabilities, we interpret the normalized LoCoH hulls as a probability density. The application of the approach described here results in optimal parameter sets that differ dramatically from those selected using the original formulation. The derived metrics of home range size, mean revisitation rate, and mean duration of visit are also altered using the corrected formulation. Conclusion Despite these differences, we encourage the use of the cross-validation-based approach, as it provides a unifying framework governed by the statistical properties of the home ranges rather than subjective selections by the user. Electronic supplementary material The online version of this article (10.1186/s40462-018-0128-2) contains supplementary material, which is available to authorized users.


Background
Continued exploration of the the cross-validation-based approach proposed in [1] has revealed a number of issues with the original formulation of the optimization equation. This original formulation was ad hoc in its combination of two statistical approaches (cross-validation and information criteria), and the result was a metric without a clear basis in statistical theory. As such, we strongly recommend that users rely upon the method described here as opposed to one set forth in the original publication. In particular, the shortcomings can be summarized as follows: 1. Both cross-validation and information criterion approaches aim to avoid over-fitting. In the case of cross-validation, one attempts to estimate out-of-sample prediction error, so the score used should be a measure of prediction errors of the held-out points. If the model uses k too small or s too large, it is likely to overfit the training data and will predict the testing data poorly. On the other hand, if the model uses k too large or s too small, it will underfit the training data by missing the real variations in space use. Thus, cross-validation naturally penalizes model complexity because excessive complexity (small k ) results in poor predictions. Information criteria approaches include a penalty term that increases with model complexity as measured by larger numbers of parameters. Using such an information criterion as a cross-validation score is not necessary since cross-validation should naturally penalize excessive model complexity. 2. The formulation of the information criterion score did not follow the rules of probability because probabilities of out-of-sample predictions were not properly normalized, and multiple probabilities were combined by summation. In this sense, it lacked a firm connection to the statistical theory underlying information criteria approaches.
Here we propose an alternative formulation in which we interpret a normalized version of LoCoH hulls as an estimated probability surface and recast the cross-validation score as the total log probability of out-of-sample predictions, a common choice in cross-validation schemes. The approach, explained in detail below, results in more appropriate behavior, but also has the effect of significantly altering the optimal parameter values selected by the algorithm. Thus, in addition to presenting the new crossvalidation equation, we include tables and figures with the newly selected parameter values and newly calculated derived metric values (home range area, mean duration, and mean visitation rates). Finally, we offer an alternative R script that searches a much broader parameter space in a more efficient manner (Additional file 1).

Updated Cross-Validation Approach
Using the training/testing split as described in the original presentation of the algorithm, a grid-based exploration of parameter space was conducted ( Fig. 1), whereby each of the training/testing datasets (i = {1, ..., n}) was analyzed at every combination of k and s values on the grid. This analysis entailed the creation of local convex hulls with k nearest neighbors and a scaling factor of s. In all subsequent analyses, we assume that the scaling of time follows a linear formulation; however, when movement patterns more closely exemplify diffusion dynamics, an alternative equation for the TSD may be more appropriate [2]. The test points (j = {1, ..., m}) were then laid upon the resulting hulls.
We formulate the probabilities for out-of-sample points by normalizing the LoCoH surface so that the probability of an observation occurring at a particular location can be calculated. This value is obtained by dividing the number of training hulls that contain the test point location (g i,j ) by the summed area of all training hulls (A i ). Then, the log probability was calculated for each point per training hullset. To avoid log probability values of -∞, test points that were not contained within any hulls were assigned a probability value equal to the inverse of A 2 i , resulting in a substantially lower log probability than that of a test point contained in a single hull. Finally, a single value (P k,s ) was assigned to each combination of k and s value by summing across all of the test points in all of the training/testing datasets: Because the probability of each test point is normalized based on the total area contained within all of the training hulls, there exists a natural penalty for high k values. For example, a k value equal to the number of training points (k max ; regardless of the s value) will result in all hulls being identical and each test point overlapping all of the hulls.  However, the large total area of the hullset when k = k max will result in relatively small probability values for each test point (i.e., independent probability values equal to the inverse of the area of one of the hulls), effectively penalizing the parameter set containing k max . The underlying cross-validation procedure could very easily be extended for the optimization of the the adaptive parameter in the a-method (as opposed to the k-method) because of its scaling with the total area of the hullset.

Results
The optimal parameter values selected using the corrected cross-validation method are substantially different from those selected in the original publication (Table 1). However, because the original formulation was not supported The total area of the home range obtained using the parameter sets recommended by the algorithm and by the guidelines set forth in the T-LoCoH documentation by cohesive statistical theory, we will discuss these new results only in reference to the guideline-based parameter values rather than comparing them to the results emerging from the published algorithm. The mean s value selected using the algorithm for springbok was 0.02 (SE = 0.008) and for zebra was 0.0012 (SE = 0.0005). The mean s value selected using the guidelines for springbok was 0.005 (SE = 0.002) and 0.017 (SE = 0.002) for zebra. Thus, the s values selected by the algorithm and the guidelines were not significantly different for springbok (p = 0.10), but were for zebra (p < 0.001). In the case of the k values, the optimal values selected using the algorithm were significantly higher than those resulting from the guidelines. The mean k value selected using the algorithm for springbok was 225.5 (SE = 66.83) whereas the mean using the The derived metrics obtained using the parameter sets recommended by the algorithm and by the guidelines set forth in the T-LoCoH documentation The significantly higher k values emerging from the algorithm gave rise to significantly larger home ranges in both species (Table 2). In springbok, the mean home range size was 265.41 km 2 (SE = 76.23 km 2 ) using the high end of the guideline based range, and 401.64 km 2 (SE = 127.56 km 2 ) using the algorithm (p = 0.05). In zebra, the mean home range was 694.43 km 2 (SE = 80.81 km 2 ) using the guidelines and 1081.29 km 2 (SE = 162.17 km 2 ) when the algorithm was applied (p = 0.01). When the derived metrics were considered, however, the substantial differences in k values did not always result in significantly different duration (Table 3) and visitation rates (Table 4). Though the duration rates in zebra derived from the algorithm were, indeed, significantly higher than those derived using the high value from the range based on the guidelines (p = 0.05), this was not the case for springbok (p = 0.08). Similarly, the visitation rates emerging from the parameter sets selected by the algorithm were not significantly different from those derived based on the guidelines in either species (p = 0.33 in springbok and p = 0.15 in zebra).

Fig. 3
High Resolution Cross-Validation Surface. A high resolution depiction of a portion of the optimal parameter space traversed during the final stage of the efficient search algorithm. All parameter sets with log probability values above -10090 are shown, with darker shading indicating higher probability. In this particular application, the search is performed over smaller intervals of s (0.0001 rather than 0.001), and the optimal parameter set (k = 171 and s = 0.0133) is similar to the parameter set selected at the coarser scale

Conclusion
The results presented here indicate that the effect of selecting parameters using the algorithm rather than the guidelines will be highly contingent upon the focus of the research question. Where home range delineation is the goal, the results are likely to differ significantly (Fig. 2). In the case of epidemiological questions, however, the effects will be somewhat less predictable, and in certain cases, similar conclusions might be drawn irrespective of the approach used for selecting optimal parameters. If an element of the analysis involves comparisons across individuals or species, however, the cross-validation-based approach provides a unifying framework governed by statistical properties of the home ranges rather than subjective selections by the user.

Additional file
Additional file 1: A new R script for a more efficient grid-based search (Fig. 3) can be found at: https://github.com/doughertyeric/Updated_T-LoCoH_Algorithm. As currently parameterized, the grid-based search algorithm covers s values from 0 to 0.05 and k values between 4 and 800. The algorithm searches across the broadest set of k values in intervals of 20 and s values in intervals of 0.01. Upon identifying a peak in the probability surface, the algorithm selects a range of 40 k values around the peak and refines the search there in k value increments of 5. Finally, another range of 10 possible k values is selected and the finest scale grid-search is conducted in intervals of 1 and s value intervals of 0.001 before selecting the optimal parameter set. (R 11 kb)