^{1}

^{2}

^{3}

^{*}

^{1}

^{4}

^{1}

^{5}

^{1}

^{6}

^{3}

^{7}

^{3}

^{8}

^{9}

^{7}

^{10}

^{11}

^{1}

^{6}

^{12}

^{13}

^{2}

^{1}

^{4}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: KC JFG DLS. Performed the experiments: KC AG AM. Analyzed the data: KC GEGP. Contributed reagents/materials/analysis tools: LM JL SE PLG GT AF. Wrote the paper: KC REG JFG DLS GT.

An understanding of the factors driving the distribution of pathogens is useful in preventing disease. Often we achieve this understanding at a local microhabitat scale; however the larger scale processes are often neglected. This can result in misleading inferences about the distribution of the pathogen, inhibiting our ability to manage the disease. One such disease is Buruli ulcer, an emerging neglected tropical disease afflicting many thousands in Africa, caused by the environmental pathogen

Following extensive sampling of the community of aquatic macroinvertebrates in Cameroon, we select the 5 dominant insect Orders, and conduct an ecological niche model to describe how the distribution of

We find that the distribution of the bacterium in Cameroon is accurately described by the land cover and topography of the watershed, that there are notable seasonal differences in distribution, and that the Cameroon model does not predict the distribution of

Future studies of

Many pathogens persist in the environment, and an understanding of where they are can assist in disease control, allowing us to identify areas of risk to local human populations. Herein, we use general linear models to describe the distribution of a particular environmental pathogen,

Knowledge of the spatial distribution of an environmentally persistent pathogen is often key in creation of environmental hazard maps for disease control. Yet, despite the importance of this spatial information, only 4% of such pathogens have been mapped

Knowledge of the distribution of suitable habitats would allow us to predict the expected distribution of the pathogen. This approach has been successfully applied to the vectors of diseases such as malaria, plague and dengue

We hypothesised that these microhabitat variables could be indirectly inferred from large scale macroecological patterns. The distribution of swamp and forested environment, the shape and structure of the landscape, should predict the distribution of these microhabitats. For example, while the suitable habitat of a bacterium may be driven by the suitable combination of

We undertook ecological niche modelling of

The pathogen of our study,

Identification of the landscape variants that indicate suitable habitat for this particular pathogen has proven remarkably difficult, despite decades of research (see

However, the implication that the microbe is a specialist has been (apparently) contradicted by recent detection of the bacterium in the environment.

The many different species that

To address our questions we describe landscape variables correlated to the presence of the bacterium in aquatic macroinvertebrates in Cameroon, Central Africa. We then test our model against data collected in French Guiana to explore the generalizability of our findings. This will contribute to an understanding of the spatial distribution of this environmental pathogen, and further our ability to control Buruli ulcer disease.

A model was constructed on the dataset from Akonolinga, Cameroon, and predicted into French Guiana, South America. This enabled us to describe the niche of

The Cameroon dataset is a subset of that published in ^{2} were done to sample the aquatic community. Aquatic organisms were classified down to the Family level whenever possible and stored separately in 70% ethanol. Individuals belonging to the same taxonomic group were pooled together for detection of

Within Cameroon, Akonolinga is almost entirely rainforest. This region is dominated by the Nyong river and has fewer highland areas. Red dots are sample sites in Akonolinga.

Wet season | Dry season | ||||||

Site Code | Latitude | Longitude | Type of water body | Relative abundance of the 5 Orders (%) | PCR positive samples of the 5 Orders,% (positive/samples) | Relative abundance of the 5 Orders (%) | PCR positive samples of the 5 Orders,% (positive/samples) |

A1 | N3 46.806 | E12 16.133 | Swamp | 93.69 | 6.98% (3/43) | 97.39 | 20% (4/20) |

A2 | N3 47.083 | E12 15.383 | Swamp | 94.69 | 12.36% (11/89) | 94.53 | 8.11% (3/37) |

A3 | N3 46.316 | E12 14.440 | Stream | 90.82 | 2.56% (1/39) | 90.41 | 0% (0/20) |

A4 | N3 58.464 | E12 14.796 | River | 55.12 | 5.13% (2/39) | 32.17 | 5.26% (1/19) |

A5 | N4 02.255 | E12 15.620 | Stream | 83.64 | 10% (4/40) | 75.30 | 5.26% (1/19) |

A6 | N3 43.483 | E12 16.466 | Swamp | 90.31 | 17.24% (15/87) | 85.85 | 2.94% (1/34) |

A7 | N3 38.889 | E12 15.986 | River | 79.52 | 10.26% (4/39) | 79.13 | 0% (0/13) |

A8 | N3 38.980 | E12 14.696 | River | 64.86 | 9.09% (4/44) | 73.89 | 7.69% (1/13) |

A9 | N3 29.912 | E12 06.425 | Swamp | 82.40 | 4.4% (4/91) | 91.67 | 2.7% (1/37) |

A10 | N3 29.912 | E12 06.425 | Swamp | 93.62 | 7.5%(3/40) | 89.90 | 0% (0/20) |

A11 | N3 23.271 | E12 07.870 | River | 50.12 | 4.65% (4/86) | 59.56 | 8.82% (3/34) |

A12 | N3 28.788 | E12 07.255 | River | 52.87 | 2.44% (1/41) | 66.19 | 15% (3/20) |

A13 | N3 32.322 | E11 57.643 | Swamp | 81.92 | 2.22% (2/90) | 79.39 | 8.11% (3/37) |

A14 | N3 38.032 | E11 59.695 | River | 64.77 | 15.38% (6/39) | 63.37 | 20% (4/20) |

A15 | N3 32.288 | E11 55.239 | Flooded | 94.33 | 11.11% (4/36) | 97.54 | 0% (0/15) |

A16 | N3 32.276 | E11 55.181 | Stream | 89.85 | 2.86% (1/35) | 93.40 | 0% (0/15) |

A data set following the same methodology was independently collected in French Guiana, South America

Wet season | ||||

Site Code | Latitude | Longitude | Relative abundance of 5 Orders (%) | PCR positive samples of the 5 Orders,% (positive/samples) |

FG10 | N4 44.170 | W-52 19.618 | 58.62 | 44.12% (15/34) |

FG11 | N4 50.284 | W-52 21.195 | 52.54 | 32.26% (10/31) |

FG19 | N5 17.773 | W-53 03.085 | 62.50 | 10.00% (1/10) |

FG2 | N5 37.888 | W-53 42.433 | 70.83 | 11.76% (6/51) |

FG23 | N5 21.724 | W-53 2.0200 | 29.27 | 16.67% (2/12) |

FG28 | N5 36.328 | W-53 49.660 | 56.60 | 20.00% (6/30) |

FG34 | N4 50.068 | W-52 18.126 | 85.00 | 32.35% (11/34) |

FG38 | N5 23.646 | W-52 59.521 | 73.74 | 13.70% (10/73) |

FG41 | N5 25.725 | W-53 05.326 | 41.07 | 8.70% (2/23) |

FG43 | N5 22.632 | W-52 57.232 | 75.47 | 2.50% (1/40) |

FG44 | N4 20.052 | W-52 09.148 | 25.42 | 0.00% (0/15) |

FG45 | N4 18.025 | W-52 07.397 | 61.95 | 2.86% (2/70) |

FG46 | N5 02.121 | W-52 30.989 | 74.24 | 2.04% (1/49) |

FG47 | N4 55.744 | W-52 24.229 | 65.00 | 0.00% (0/26) |

FG48 | N4 51.616 | W-52 16.518 | 16.67 | 0.00% (0/1) |

FG49 | N5 39.996 | W-53 46.794 | 36.54 | 0.00% (0/19) |

FG53 | N5 36.136 | W-53 50.182 | 67.86 | 0.00% (0/57) |

FG7 | N4 51.648 | W-52 15.405 | 29.41 | 0.00% (0/10) |

_{wet} and Y_{dry}, which we use to describe the proportion of

Land cover in Akonolinga was described using several multispectral satellite images; SPOT 2.5 meter resolution images (references: 50833380811220923092V0 and 50833371012210937422V0), and a Landsat image (reference L72186056_05620021107). The study area was categorised into the following classes; Agriculture, Forest, Flood plain, Road, Savannah, Swamp and Urban (^{st} order being small streams, larger orders being big rivers). Proportion of 1^{st} to 8^{th} order streams, defined by Strahler method

The topography and land cover of the sample sites were described within two different buffers (

This is in the north of Akonolinga, near the village of Emvong. The upper panel is a 5 km buffer around the sites, within this region we describe the topography and land cover, and its association with

The second buffer was defined using the watershed of the sample site (

The 42 variables estimated to describe the landscape were reduced to permit modelling. Principal component analysis (PCA) was performed on the landscape variables centred at the mean (_{mean})) to summarize the data in the watershed and the 5 km buffer. PCAs were performed with the PCA function in the FactoMineR library in R _{ws}, and a PCA of the 42 environmental variables in the 5 km buffer, PCA_{5 km}. In each PCA we examined the orthogonal axes that explained 95% of the variance in the 42 topography and land cover variables.

Firstly, 9 principal components explained 95% of the variance in the watershed of the sample site (PCA_{ws}). The magnitude and direction of each correlation is given in the supplementary materials (_{ws}1 as “large watersheds that drain flood plains”, given its strongly positive correlations to watershed surface area and floodplains; PCA_{ws}2 as “large watersheds that drain highland agriculture”; PCA_{ws}3 as “large watersheds that drain lowland agriculture”; PCA_{ws}4 as “small watersheds that drain swamp and forest at flat intermediate elevations”; PCA_{ws}5 as “small watersheds that drain highland urban and savannah”; PCA_{ws}6 as “small watersheds that drain highland urban and forest”; PCA_{ws}7 as “large watersheds that drain lowland forest, savannah and swamp”; PCA_{ws}8 as “small watersheds that drain urban and agricultural environments in hilly lowlands”; and PCA_{ws}9 as “small watersheds that drain wet swamps in areas that reach from low to high elevations” (

Secondly, for the local 5 km circular buffer, 6 principal components (PCA_{5 km}) explained 95% of the variance in the data as described in SM2. Translating these to ecologically meaningful terms, we describe PCA_{5 km}1 as representing “sites surrounded by flat lowland areas with urban, agriculture and the flood plains of large rivers”; PCA_{5 km}2 as representing “sites surrounded by sloped highland areas with urban, agriculture and small rivers”; PCA_{5 km}3 as representing “sites surrounded by sloped highland areas with savannah and large swampy rivers”; PCA_{5 km}4 as representing “sites surrounded by flat lowland areas with savannah and small rivers”; PCA_{5 km}5 as representing “sites surrounded by flat highlands with urban, agriculture and large rivers”, and PCA_{5 km}6 as representing “sites surrounded by lowland hills, with small rivers and many small basins, in unforested environment”, (

We allow model selection to choose which of these principal components are most informative in the species distribution, Y_{wet} and Y_{dry}. The dry season general linear models (GLMs) and wet season GLMs were fitted separately with glmulti in the glmulti library in R. Glmulti finds the best set of GLMs among all possible combinations of explanatory variables; so for example all possible Y_{dry}∼PCA_{5 km} models were fitted, and each was evaluated with the Akaike information criterion corrected for small sample sizes (AICc). Low AICc scores indicate good performance and reduced overfitting

The response variable changed seasonally, resulting in two response variables, Y_{dry} and Y_{wet}. Along with the PCA_{5 km} and PCA_{ws} inputs this resulted in four models; Y_{dry}∼PCA_{5 km} and Y_{dry}∼PCA_{ws} in the dry season, and Y_{wet}∼PCA_{5 km} and Y_{wet}∼PCA_{ws} in the wet season. This reduces our variables by retaining those that are important. Then, to compare the importance of PCA_{5 km} (local) and PCA_{ws} (regional watershed) in the distribution of the response variable, _{dry}∼PCA_{5 km}+PCA_{ws} in the dry season, and Y_{wet}∼PCA_{5 km}+PCA_{ws} in the wet season. In this way, by allowing glmulti to retain or drop these variables we can compare the importance of the watershed and local 5 km area variables in the distribution of

Potential effects of multicolinearity were explored but were deemed minimal, as all pairwise Pearson correlation coefficient R values in the principal components were below 0.75 (

In the initial screen of variables, Y_{dry}∼PCA_{5 km} and Y_{dry}∼PCA_{ws} retained PCA_{ws}4, “small watersheds that drain swamp and forest at flat intermediate elevations”, PCA_{ws}9, “small watersheds that drain wet swamps in areas that reach from low to high elevations” and PCA_{5 km}2, “sites surrounded by sloped highland areas with urban, agriculture and small rivers”. These were included in the model of interest, Y_{dry}∼PCA_{5 km}+PCA_{ws}.

For the wet season Y_{wet}∼PCA_{5 km} and Y_{wet}∼PCA_{ws} retained PCA_{ws}1, “large watersheds that drain flood plains”, PCA_{ws} 5, “small watersheds that drain highland urban and savannah”, PCA_{ws} 6, “small watersheds that drain highland urban and forest”, PCA_{ws} 8, “small watersheds that drain urban and agricultural environments in hilly lowlands”, PCA_{5 km}2, “sites surrounded by sloped highland areas with urban, agriculture and small rivers” and PCA_{5 km}4, “sites surrounded by flat lowland areas with savannah and small rivers”, which were included in Y_{wet}∼PCA_{5 km}+PCA_{ws}.

We interpolate the Akonolinga model within the region of Akonolinga to predict the distribution of suitable habitat, the reservoir, of _{5 km} and PCA_{ws} format, and the GLM was predicted. As a summary to describe this distribution, we use Morans Index of spatial autocorrelation, which describes the extent to which the distribution is random, and is here used to describe the distribution of suitable sites. This is implemented using the tool Spatial Autocorrelation Global Moran's I in ArcMap10.1

We extrapolate the Akonolinga wet season model to French Guiana, to understand how the suitable habitat in one region is similar to that in another. For comparability, the wet season model, constructed in Cameroon, was used to predict the positive sites among the 18 sampled sites in French Guiana. Values of PCA_{5 km} and PCA_{ws} in French Guiana were generated using the ind.sup option in the PCA function. The Akonolinga wet season model was then predicted into French Guiana using the land cover data provided by the French

As discussed above, the choice of error structure is important in the performance of a GLM. We aim to describe the distribution of the bacterium, so preference is given to the model with the lowest residual values in the model, which in this case is Gaussian rather than Binomial error structure. Residuals were much lower in a Gaussian model, as shown in

The wet and dry season watershed Gaussian models were predicted on the pour point data using the predict.glm function in R. The model predictions of habitat suitability at these pour points were then interpolated using Inverse Distance Weighting in the IDW tool of ArcMap 10

The final fitted wet season Binomial logit GLM, after stepwise AICc selection, was_{ws}9, “small watersheds that drain wet swamps in areas that reach from low to high elevations”, and was negatively correlated to _{5 km}2 represents “sites surrounded by sloped highland areas with urban, agriculture and small rivers”. This was also negatively correlated to

The spatial distribution of

Units of habitat suitability are the proportion of qPCR pools predicted to be positive, based on the field work of

The final fitted dry season binomial logit GLM, after stepwise AICc selection, is_{ws}1, “large watersheds that drain flood plains”, which was marginally negatively correlated to _{5 km}2, “sites surrounded by areas with urban, agriculture and small rivers” was positively correlated to _{5 km}4, “sites surrounded by areas with savannah and small rivers”, was positively correlated to

The spatial distribution of

Spatial autocorrelation of model residuals can be an issue in GLMs, but this was explored, and it was not the case here. Model residuals were not significantly spatially autocorrelated in the wet season (Moran's Index: −0.285386, z-score: −1.045844,

The AICc of the final dry season Binomial model was 49.6, the absolute sum of the residuals was 11.03. The AICc of the final wet season Binomial model was 67.8, the absolute sum of the residuals was 11.95.

We note that Gaussian models had significantly better performance. The AICc of the final dry season Gaussian model was −39.8, the absolute sum of the residuals was 0.53. The AICc of the final wet season Gaussian model was −65.5, the absolute sum of the residuals was 0.24. Model performance is presented in

The Akonolinga wet season model was predicted into 18 sample sites in French Guiana (^{nd} row). The model predicted sites to be positive or negative, and the results of qPCR corroborated these predictions (

Sample sites were as in ^{rd} row, left hand side). The model under-predicted,

Here, we have demonstrated that in addition to local variables around the sample site, the distribution of

Many of the findings are in accord with what little we already understand about this bacterium. _{ws}9, here termed “small watersheds that drain wet swamps in areas that reach from low to high elevations” which negatively correlated to

Our study was limited in certain regards, as we focused it on the prevalence of

The Akonolinga wet season model was extrapolated into French Guiana, where sampling was in the wet season. Despite good performance in Akonolinga, the model performed poorly in French Guiana, under-predicting the bacterium's distribution (

Regardless of error structure, selection of both types of models (Gaussian and Binomial) retained watersheds as important variables. These findings will impact future research on Buruli ulcer and

This is consistent with the idea of a ‘flushing’ effect of rainfall in the wet season, carrying bacteria downstream

The distribution of environmental pathogens needs to be understood to facilitate control. Commonly, local effects in the microhabitats are considered to describe the ecological niche of a pathogen. However our study demonstrates that regional effects are important factors to be considered. Future research on the

GLMulti output, for binomial and Gaussian models. Sum of absolute model residuals are plotted against AICc. Within the region of 2 AICc scores of the best model (vertical lines) we select the model with the lowest residuals (highlighted in red).

(DOC)

Observed against predicted values for each model. Note that Gaussian models have a much better fit.

(DOC)

Quantile-quantile plots of normality. The Gaussian and Binomial are both similarly normally distributed, though the Binomial displays a larger variance of residuals.

(DOC)

Results of principle component analysis for topographical and land cover variables in a watershed buffer. 95% of the variance in the data was described with 9 components, the eigenvalue of each component is given at the bottom of the table. Each component correlates differently to different variables, red highlights negative correlations, blue highlights positive correlations. PCA_{ws}1 describes large watersheds that drain flood plains and swamps, with few urban and agricultural areas. These are high elevation areas with variable slopes. PCAws2 describes large watersheds that drain agriculture at flat highland areas. PCA_{ws}3 describes large rivers that drain urban and agriculture areas at flat lowlands with, with little forest. PCA_{ws}4 describes small rivers, with small watersheds that drain forest and swamp areas, without urban areas. These are at intermediate elevations, with flat areas. PCA_{ws}5 describes small rivers that drain urban and savannah areas, predominantly in higher elevation flat lands. PCA_{ws}6 corresponds to small low order streams that drain urban and forest (not agriculture) in high elevation slopes. PCA_{ws}7 is larger watersheds that drain forest, savannah flood plain and swamp, in areas with flat, wet, lowlands. PCA_{ws}8 represents small watersheds that drain urban & agriculture, flood plain and savannah. These areas are wet lowlands with lots of small hills. PCA_{ws}9 represents small watersheds that drain wet swamps in areas that reach from low to high elevations.

(DOC)

Results of principle component analysis for topographical and land cover variables in a 5 km buffer around the sample site. 95% of the variance in the data was described with 6 components. Each component correlates differently to different variables, red highlights negative highlights, blue indicates positive correlations. Surface area is constant, at π5^{2} = 79 km^{2}. PCA_{5 km}1 represents sites surrounded by flat lowland areas and urban, agriculture and the flood plains of large rivers. PCA_{5 km}2 represents sites surrounded by sloped highland areas and urban and agriculture, and small rivers. PCA_{5 km}3 represents sites surrounded by sloped highland areas with savannah, and large swampy rivers. PCA_{5 km}4 represents sites surrounded by flat lowland areas with savannah and small rivers. PCA_{5 km}5 represents sites surrounded by flat highlands with urban and agriculture, and large rivers. PCA_{5 km}6 represents sites surrounded by lowland hills, with small rivers and many small basins, in unforested environment.

(DOC)

Pearson product R correlation coefficients in the wet season model. Stepwise selection selected 3 components, none of which were correlated.

(DOC)

Pearson product R correlation coefficients in the dry season model. Stepwise selection selected 6 components, none of which were correlated.

(DOC)

Contingency table describing model performance of niche models constructed in Cameroon and predicted into French Guiana. The rows ‘Prediction’ are model predictions, ‘Test’ are the results from qPCR of the sites in French Guiana. Values in blue are true positives and true negatives; values in red are false positives and false negatives.

(DOC)

We are grateful to the staff of the Centre Pasteur and IRD for their invaluable help in different phases of the study, notably during data collection. We also thank Annelise Tran of CIRAD and Benjamin Roche of IRD for invaluable discussions and insights on previous versions of the manuscript, the ISIS Spot programme for support in acquiring SPOT images, and Hervé Chevillotte (IRD Cameroon), for environmental data from the IFORA project (ANR-Biodiv grant IFORA).