The package ‘spm’ is short for ‘spatial predictive modelling, spatial predictive model(s), or spatial predictive method(s)’. It aims to introduce some novel, accurate, hybrid geostatistical and machine learning methods for spatial predictive modelling. It contains functions for a few geostatistical and machine learning methods as well as their hybrid methods. Further methods will be included in the future according to their predictive accuracy and new developments in spatial predictive modelling field.

For each method, two functions are provided. Two further functions are provided to derive averaged variable importance and relative variable influence respectievly. They all use data.frame as input data. Moreover, two functions are provided for accuracy assessment, and a further function is for converting error measures (e.g., rmse) to accuracy measure (i.e., vecv). These functions attempt to simplify and streamline the model evaluation and model application processes, which may assist users to apply these methods to their data to improve modelling efficiency as well as predictive accuracy.

The methods currently included in the package are:

- two commonly used geostatistical methods: inverse distance weighting (idw) and ordinary kriging (ok);
- two highly accurate machine learning methods: random forest (rf) in library(randomForest) and general boosting model (gbm) in library(gbm);
- four hybrid methods: rfidw, rfok, gbmidw and gbmok;
- two averaging methods: the average of rfidw and rfok (rfokrfidw), and the average of gbmidw and gbmok (gbmokgbmidw); and
- For random forest method, all relevant methods above are also implemented using ranger (rg) in library(ranger), which lead to four methods: rg, rgidw, rgok and rgokrgidw. These methods are fast version of their corersponding rf methods.

A function, avi, is developed to derive an averaged variable importance (avi) based on random forest. And a function, rvi, to a derive relative variable influence (rvi) based on generalized boosted regression modeling.

Of the two functions for each method, one is for assessing the predictive errors and accuracy of the method based on cross-validation. These functions are:

- idwcv,
- okcv,
- rfcv,
- rgcv
- gbmcv,
- rfidwcv,
- rfokcv,
- rgidwcv,
- rgokcv,
- gbmidwcv,
- gbmokcv,
- rfokrfidwcv,
- rgokrgidwcv, and
- gbmokgbmidwcv.

The procedure for these functions can be summarised in Table 1.

Step | Description |
---|---|

1.1 | Resample the training data into k sub-datasets for k-fold validation |

1.2 | for Each training and corresponding validating sub-dataset, do |

1.3 | Develop a predictive model based on the training sub-dataset |

1.4 | Validate the model using the validating sub-dataset and record the resultant predictions |

1.5 | Repeat above procedure for each sub-dataset |

1.6 | end |

1.7 | Calculate the predictive accuracy based on the training data and the resultant predictions |

The accuracy produced from the above procedure may change with each run due to randomness associated with data resampling for cross validation. To stabilise the accuracy measure, the above steps need to be repeated n times; and the accuracy assessment function for each method provides an useful tool to realise this as demonstrated in section 4.3.

A unit, scale and data-variance independent accuracy measure, the vecv (i.e., variance explained by predictive models based on cross-validation) is implemented in function vecv for numerical data.

A few other error and accuracy measures are also available in function pred.acc for numerical data, including mean error (me), mean absolute error (mae), mean squared error (mse), relative me (rme), relative mae (rmae), root mse (rmse), relative rmse (rrmse), as well as the accuracy measure, vecv and Legates and McCabe’s E1 (e1).

pred.acc can also be sued to calculate correct classification rate (ccr), kappa (kappa), sensitivity (sens), specificity (spec) and true skill statistic (tss) for categorical data with the observed (obs) data specified as factor.

They are based on the differences between the predicted values for and the observed values of validation samples for cross-validation. For 0 and 1 data, the observed values need to be specified as factor in order to use accuracy measures for categorical data. Moreover, sens, spec, tss and rmse are for categorical data with two levels (e.g. presence and absence data).

A function, tovecv, is developed to convert existing predictive error measures to vecv. For the definition of vecv, please see function vecv in library (spm). The error measures considered are mean square error (mse), root mse (rmse), relative rmse (rrmse), standardised rmse (srmse) and mean square reduced error (msre).

The other function for each method is for generating the spatial predictions using the method, including:

- idwpred,
- okpred,
- rfpred,
- rgpred,
- gbmpred,
- rfidwpred,
- rfokpred,
- rgidwpred,
- rgokpred,
- gbmidwpred,
- gbmokpred,
- rfokrfidwpred,
- rgokrgidwpred and
- gbmokgbmidwpred.

With certain parameters specified for a method, the cross-validation function (e.g., rfokcv) produces information on the predictive accuracy of the method, and the prediction function (e.g., rfokpred) produces spatial predictions. By using the same random seed, the predictive accuracy values produced for these methods are directly comparable.

Install spm using:

The function idwcv is applied to seabed gravel content samples in the Petrel region to test the predictive accuracy of inverse distance squared, a commonly used spatial interpolation method.

The function rfokcv is applied to seabed gravel content samples in the Petrel region to test the predictive accuracy of RFOK.

```
library(spm)
data(petrel)
set.seed(1234)
rfokcv1 <- rfokcv(petrel[, c(1,2)], petrel[, c(1,2, 6:9)], petrel[, 5], predacc = "VEcv")
rfokcv1
[1] 39.88995
```

This accuracy is comparable with that of idw above as their random seeds were the same, i.e., the same datasets were generated and used in the cross validation.

The cross-validation functions above enable users to find optimal parameters and/or predictors for these methods (e.g., nmax and idp for idw). Taking idw as an example as below.

```
data(petrel)
idp <- c((1:10)*0.2)
nmax <- c(10:20)
idwopt <- array(0,dim=c(length(idp),length(nmax)))
for (i in 1:length(idp)) {
for (j in 1:length(nmax)) {
set.seed(1234)
idwcv2.3 <- idwcv(petrel[, c(1,2)], petrel[, 5], nmax = nmax[j], idp = idp[i], predacc = "VEcv" )
idwopt[i, j] <- idwcv2.3
}
}
which (idwopt == max(idwopt), arr.ind = T )
> row col
[1,] 3 3
idp[3]
> [1] 0.6
nmax[3]
> [1] 12
```

Predictive accuracy of optimised idw

```
library(spm)
data(petrel)
set.seed(1234)
idwcv1 <- idwcv(petrel[, c(1,2)], petrel[, 5], nmax = 12, idp = 0.6, predacc = "VEcv")
idwcv1
[1] 35.93557
```

As usual, such parameter and variable selection is time consuming if the range of relevant parameter and/or number of predictors are large, especially for gbm and gbm related methods. For the machine learning methods, it is recommended to use the existing variable selection methods to do preliminary selection and then use above procedure to do further selection if needed.

These selections may produce the most accurate predictive models, but the resultant predictions may not look real (i.e. containing artefacts) that may not satisfy the need of some clients. So the artefacts may need to be minimised by sacrificing a small amount of the accuracy to meet client’s expectation.

Again taking idwcv as an example, the accuracy produced above changes with each run of idwcv, i.e., with random seed. To stabilise the accuracy, we repeatedly run idwcv many times to produce an averaged predictive accuracy.

```
n <- 100 # number of iterations, 60 to 100 is recommended.
measures <- NULL
for (i in 1:n) {
idwcv1 <- idwcv(petrel [, c(1,2)], petrel [, 5], nmax = 12, idp = 0.6, predacc = "ALL")
measures <- rbind(measures, idwcv1$vecv)
}
mean(measures)
[1] 33.69691
```

This procedure can be applied to all other cross validation functions in spm to produced stablised predictive accuracy of relevant methods.

Seabed gravel content samples in the Petrel region, northern Australia marine margin are used to make spatial predictions using idw by running idwpred.

```
library(spm)
data(petrel)
data(petrel.grid)
idwpred1 <- idwpred(petrel[, c(1,2)], petrel[, 5], petrel.grid, nmax = 12, idp = 0.6)
names(idwpred1)
[1] "LON" "LAT" "var1.pred" "var1.var"
idwpred1 <- (idwpred1)[, -4] # remove the 4th column as it contains no information.
class(idwpred1)
[1] "data.frame"
names(idwpred1) <- c("longitude", "latitude", "gravel")
head(idwpred1)
longitude latitude gravel
470277 128.8022 -10.60239 22.00789
470278 128.8047 -10.60239 22.00805
470279 128.8072 -10.60239 22.00822
470280 128.8097 -10.60239 22.00838
470281 128.8122 -10.60239 22.00855
470282 128.8147 -10.60239 22.00873
```

Since grid file (e.g. petrel.grid) can be in WGS84 or UTM, the location info of the grid file were renamed to “LON” and “LAT”, and the variable to be predicted may vary with studies so was named to “var1”, in the prediction function. Users need to rename the prediction file as shown above.

Seabed gravel content samples in the Petrel sub-basin, northern Australia marine margin are used to make spatial predictions using rfok by running rfokpred.

```
set.seed(1234)
library(spm)
data(petrel)
data(petrel.grid)
data(petrel)
data(petrel.grid)
rfokpred1 <- rfokpred(petrel[, c(1,2)], petrel[, c(1,2, 6:9)], petrel[, 5],
petrel.grid[, c(1,2)], petrel.grid, ntree = 500, nmax = 11, vgm.args = ("Log"))
class(rfokpred1)
[1] "data.frame"
names(rfokpred1)
[1] "LON" "LAT" "Predictions" "Variances"
```

As shown above, the format of the output predictions by the functions for generating spatial predictions is in data.frame and can be exported to relevant format as desired, such as csv or raster, for further use. Please note that the variances are now added back to the output, but cautions need to be taken when use them because they are independent of the predictions.

This spm package and its associated documents are published with the permission of the CEO, Geoscience Australia.

Li, J., Potter, A., Huang, Z., Daniell, J. J. and Heap, A. 2010. Predicting Seabed Mud Content across the Australian Margin: Comparison of Statistical and Mathematical Techniques Using a Simulation Experiment. Geoscience Australia, Record 2010 /11, 146pp.

Li, J., Heap, A.D., Potter, A., Huang, Z., Daniell, J., 2011. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud content across the southwest Australian margin. Continental Shelf Research 31 1365-1376.

Li, J., Heap, A.D., Potter, A., Daniell, J., 2011. Application of machine learning methods to spatial interpolation of environmental variables. Environmental Modelling & Software 26 1647-1659.

Li, J., 2011. Novel spatial interpolation methods for environmental properties: using point samples of mud content as an example. The Survey Statistician: The Newsletter of the International Association of Survey Statisticians No. 63 15-16.

Li, J., Potter, A., Huang, Z. and Heap, A. 2012. Predicting Seabed Sand Content across the Australian Margin Using Machine Learning and Geostatistical Methods. Geoscience Australia, Record 2012/48, 115pp.

Li, J., 2016. Assessing spatial predictive models in the environmental sciences: accuracy measures, data variation and variance explained. Environmental Modelling & Software 80 1-8.

Li, J., 2017. Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what? PLOS ONE 12 (8): e0183250.