The wildfire data analysis journey began with a dataset comprising over 1200 points located in the California region. To uncover the correlation between land cover types and wildfire risk, we first needed to determine the wildfire risk for each point using the FEMA Wildfire Risk Map. This step involved entering coordinates into the software to obtain the wildfire risk index for each location. From this dataset, we categorized approximately 325 points as having a very low risk, 350 as moderate risk, and 578 as very high risk for wildfires. These categories served as the strata for our stratified random sampling method, ensuring a representative sample from each risk category.
GLOBE Observers Land Cover Map
Stratified Random Sampling
Stratified random sampling, unlike simple random sampling, involves dividing the population into homogenous subgroups (strata) and then selecting random samples from each subgroup. This approach captures key characteristics of the population, enhancing precision and reducing estimation errors. In our analysis, the entire dataset of 1253 observations was divided into three strata: very high, moderate, and very low wildfire risk categories. By employing disproportionate stratified sampling, we selected 33 random points from each stratum, enabling a focused analysis on land cover types across different risk categories. This method ensured a robust representation of the various strata in the dataset.
Spreadsheet illustrating strata organization: red indicates very high wildfire risk, orange represents moderate risk, and yellow denotes very low risk. Points from each section were analyzed for land cover type.
Land Cover Analysis
After selecting the sample points, the next step was to analyze the land cover at each location. Each row in the dataset contained the source code for an image, which we manually reviewed to identify the land cover type. The land cover categories included urban, forest, wetland, grassland, and shrubland. This detailed analysis allowed us to connect specific land cover types with corresponding wildfire risk levels, providing valuable insights into how different environments influence fire susceptibility. This process showed the importance of manual verification in ensuring data accuracy and reliability.
Random Forest Regressor and Model Training
The core of our analysis involved using a Random Forest Regressor to predict wildfire risk based on land cover data and historical fire incidents. The Random Forest algorithm constructs multiple decision trees from random subsets of the training data, reducing overfitting and variance. Each tree operates independently, and their aggregated predictions improve the model's accuracy. We trained the model on 80% of the data, reserving 20% for testing. This approach allowed for assessing the model's generalizability to unseen data. By leveraging the ensemble learning method, we enhanced predictive accuracy and identified key features influencing wildfire risk.
Evaluation Metrics and Model Performance
Evaluating the Random Forest model's performance involved several key metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE). The model achieved decent values for each reflecting reasonable precision in estimating fire risk. The Random Forest Classifier's accuracy stood at 0.75, demonstrating strong predictive capabilities. These metrics highlighted the model's effectiveness while also indicating areas for improvement. By minimizing errors, the model ensured predictions closely aligned with actual values, enhancing its reliability in real-world applications.
Confidence Intervals and Implications
To quantify the certainty of the model's accuracy, I calculated the 95% confidence interval for the Random Forest Classifier's accuracy, which ranged from 60.22% to 89.78%. This interval, though broad due to the sample size of 33, provided valuable insights into the model's performance. It emphasized the need for more data to narrow the interval and improve precision. This analysis underscored the critical role of citizen science in collecting additional data points to enhance the reliability of wildfire risk predictions. The confidence interval highlighted the importance of ongoing data collection and model refinement in predicting wildfire risks accurately.
In the End
Through this process, I learned a great deal about wildfire risk assessment and the power of statistical analysis. Working with stratified random sampling and the Random Forest algorithm taught me how to handle large datasets, apply complex models, and understand the results. This experience showed me the importance of accuracy and detail in data analysis. It was exciting to see how our work could help identify areas at high risk for wildfires. Overall, this project made me appreciate data-driven methods in environmental science and showed me how important good statistical analysis is in solving real-world problems.
About the author, Akshada is a rising senior from Edison, New Jersey. This virtual internship is part of a collaboration between the Institute for Global Environmental Strategies (IGES) and the NASA Texas Space Grant Consortium (TSGC) to extend the TSGC Summer Enhancement in Earth Science (SEES) internship for U.S. high school (http://www.tsgc.utexas.edu/sees-internship/). This guest blog shares the NASA SEES Earth System Explorers virtual internship in 2024.