METHOD OVERVIEW

We present an overview of the method developed and applied in MapBiomas Soil. For more methodological details, access the ATBD (Algorithm Theory Base Document) at this LINK

1. PRESENTATION

MapBiomas Soil has developed the first beta collection of annual maps of soil organic carbon (SOD) stocks in Brazil, covering the period 1985 to 2021. These maps represent the amount of carbon present in the top soil layer, which extends from the surface to a depth of 30 centimeters. This layer is extremely important, as it is where the greatest interaction occurs between plant roots, the decomposition of organic matter, and soil formation. These maps were created using data from the SoilData Repository, along with dozens of environmental covariates that represent the formation factors of the Brazilian soil. The spatial resolution of these maps is 30 meters and they provide information about soil organic carbon stocks in tons per hectare (t/ha) for the first 30 centimeters of soil.

The MapBiomas Soil initiative aims to unravel the evolution of soil resources over time and space by adopting an open and collaborative scientific approach. This beta collection represents the first approximation of a spatio-temporal model of the Brazilian soil organic carbon stock, built on open soil data and using up-to-date knowledge in digital soil mapping. The maps can be accessed on the MapBiomas platform, available at https://plataforma.brasil.mapbiomas.org/. On this platform, it is possible to view the total soil organic carbon stocks in tons, as well as the stocks per unit area in tons per hectare, for different territorial sections.

MapBiomas Soil aims to provide valuable information about the dynamics of soil organic carbon in Brazil, assisting in the understanding of processes and in decision making related to conservation and sustainable use of soil. Soil is a dynamic environment that is constantly changing, and we are in a continuous search for improvement. We continue to improve our estimates, expand the database in the  SoilData Repository, evaluate the maps with the collaboration of local experts, and incorporate new environmental variables. Our goal is to obtain increasingly useful and relevant results for soil understanding.

2. METHOD

The beta collection of maps was obtained by machine learning algorithms that compute the relationship between point data of soil organic carbon stock with spatial (latitude and longitude) and temporal (year) coordinates, together with environmental covariates that cover the whole Brazilian territory. These covariates represent the soil forming factors, such as climate, organisms, relief and parent material, and are obtained from free and open spatial databases. The soil data used are from field soil sampling and are available in the https://soildata.mapbiomas.org/. All procedures are demonstrated in the flow chart in Figure 1 and detailed in the following items.

2.1 Spot Soil Data

Mapping soil organic carbon stock over time and space requires an extensive set of point data on soil properties such as organic carbon concentration, proportion of fine soil, density of fine soil, and thickness of the sampled layer. It is crucial to have a standardized set of field samples for this purpose. In this context, the SoilData Repository plays a key role as a centralized repository designed to store and provide open soil data for the scientific community.

Currently, SoilData houses the largest collection of point data available for the calculation of organic carbon stocks in Brazil, as well as other information on Brazilian soil properties. You can access the data at the following link: soildata.mapbiomas.org.The repository gathers information from more than 20,000 soil samples, from 247 data sets. These data are of paramount importance for the generation and improvement of the maps generated by MapBiomas Solo. By providing a comprehensive set of standardized data, SoilData enables a better understanding of soil properties, contributing to more efficient and sustainable research and practices.

From the data available in the SoilData Repository, only the samples that have temporal (year of collection) and spatial (latitude and longitude) information were considered for mapping, the resulting number of samples is illustrated in Figure 2. These samples were collected from the 1960s to the present. Although there is a considerable amount of data available for mapping soil carbon stocks, their spatio-temporal distribution is heterogeneous (Figure 2). Most of the data come from soil samples collected between the 1970s and 1990s, most of them from the RADAM project, made available by institutions such as EMBRAPA, IBGE and universities. It is important to note that the distribution of samples per biome is also heterogeneous, highlighting the Amazon and Atlantic Forest biomes, which have the largest number of samples, totaling 4467 and 2890, respectively.

Figura 2: spatial and temporal distribution of the field samples used for mapping soil organic carbon stocks. All samples have been standardized and are ready to be reused in the SoilData Repository

2.2 Environmental covariates

Environmental spatially explicit covariates are used to predict soil organic carbon stocks at unsampled locations and times. These covariates represent the factors that influence soil formation, such as controls on pedogenesis, pedogenetic processes, and soil distribution in the landscape. Each factor exerts a specific impact on soil through a variety of physical, chemical, and biological processes that occur simultaneously over time.
To include environmental covariates in the spatio-temporal model, spatially explicit data from free and open spatial databases were used. These data were considered as potential predictors of the model. The inclusion of the covariates in the predictive model was done gradually in order to evaluate their influence on the resulting maps. In total, 43 covariates were included in this beta collection, selected based on the plausibility of their relationship to the spatio-temporal dynamics of carbon stocks.


The environmental covariates were used in the modeling taking into account the temporal coverage of the data. They can be static, i.e., without temporal reference, or dynamic, with annual data. Among the static variables are the morphometric characteristics of the relief, climate classification, biome, phytophysiognomies, and pre-existing maps of soil properties. Temporal carbon dynamics were modeled based on land use and land cover data from MapBiomas collection 7.1, which were considered dynamic covariates.


This approach of incorporating spatially explicit environmental covariates into the model helps to capture the complexity and variability of the factors that influence soil organic carbon stocks over time and space. This allows for a better understanding of the dynamics and changes in carbon stocks, contributing to informed decision making regarding soil management and conservation.

Tabela 1. Set of static and dynamic covariates used in modeling according to the soil formation factor they represent.

Predictive factorCovariatesVariable typeTemporal Dimension
Soil (s)Probability of occurrence of soil classes or types: shallow, sandy , moist, humic, organic and rich in Fe and Al oxides, black soil. Source: WRB/ISRIC, 2015continuousstatic
Soil properties: clay content, silt content, sand content, cation exchange capacity, pH in water, soil density, carbon content, total nitrogen, coarse fragments. Source: SoilGrids 2.0.continuousstatic
Under-ground carbon, Above-ground carbon, Deadwood, burlap, total carbon. Source: Fourth National Communication (QCN), 2020.numericstatic
Índices espectrais minerais: óxidos de ferro e argilominerais derivados de imagens Landsatcontinuousdynamic
Climate (c)Classificação climática de Koppen. Fonte : Alvares et al. (2013)nominal categoricastatic
Organisms (o)Classification of primary vegetation: Phytophysiognomy. Source: IBGE, 2021nominal categoricastatic
Land cover and land use classification. Source: MapBiomas, Collection 7.1nominal categoricadynamic
Territorial classification (biome). Source: IBGE, 2021nominal categoricastatic
Vegetation indices (NDVI, EVI and SAVI) derived from Landsat imagescontinuousdynamic
Relief (r)Morphometric terrain properties: Slope, Composite Topographic Index, Roughness, Convergence, Profile Curvature, Digital Elevation Model 30m , North and East exposure index, flow power index. Source: Geomorpho 90m, 2020continuousstatic
Age (a)Age of land cover type and land use. Source: MapBiomas, Collection 7.1continuousdynamic
Spatial position(n)Geographic coordinates (lat/long)continuousstatic

2.3 Predictive Model

The method used to predict soil organic carbon stock was Random Forest, which is a regression technique based on a collection of randomized regression trees. To fit the model, two main parameters were defined: the number of regression trees to be fitted (parameter ntree) and the number of predictor covariates to be used in each tree (parameter mtry). These parameters are considered hyperparameters of the model.
In the case of these maps, the number of regression trees (ntree) was set to 1/10 of the total number of available samples. The number of covariates selected in each split (mtry) was set to 1/3 of the total number of covariates. The model was trained using the ee.Classifier.smileRandomForest function available on the Google Earth Engine (GEE) platform, with the output setting "Regression" to predict continuous values. For training the model, soil organic carbon data from all available observations were used, along with the year of sample collection and environmental covariates. This data set allowed a single predictive model to be created. This model was then used to make spatial predictions and temporal extrapolations. To make spatial predictions, the model was applied to each year of interest (1985-2021), considering the relevant environmental covariates for each occasion. This allowed estimates of soil organic carbon stock over time and space to be obtained, providing a comprehensive view of the dynamics of COS

2.4 Validation

The evaluation of the model in MapBiomas Soil was performed comprehensively using cross-validation and spatial cross-validation techniques. In this process, the samples were randomly separated into 10 sets, which is known as 10-fold cross-validation. In each iteration, one set was used to validate the model, while the other nine sets were used to train the model. During cross-validation, several performance metrics were calculated to evaluate the efficiency and accuracy of the model. Some of these metrics include: Mean Absolute Error (MAE), Mean Square Error (MSE), Root Mean Square Error (RMSE), and Model Efficiency (NSE). These metrics were calculated for each biome individually, allowing us to evaluate model performance in different regions and environmental conditions.