Firms engaged in producing, processing, marketing, or using lumber and lumber products always invest in futures markets to reduce the risk of lumber price volatility. The accurate prediction of real-time prices can help companies and investors hedge risks and make correct market decisions. This paper explores whether Internet browsing habits can accurately nowcast the lumber futures price. The predictors are Google Trends index data related to lumber prices. This study offers a fresh perspective on nowcasting the lumber price accurately. The novel outlook of employing both machine learning and deep learning methods shows that despite the high predictive power of both the methods, on average, deep learning models can better capture trends and provide more accurate predictions than machine learning models. The artificial neural network model is the most competitive, followed by the recurrent neural network model.Abstract
Lumber futures have been traded at Chicago Mercantile Exchange since 1969 (Mehrotra and Carter 2017). Since the COVID-19 pandemic, the lumber futures price has experienced huge volatility. Figure 1 plots the daily opening price of lumber futures from May 3, 2011, to May 28, 2021. The opening price of lumber futures plummeted on April 1, 2020, then it returned to normal levels seen prior to the pandemic. After that, it continued to climb steeply and finally reached its highest point in 10 years on May 7, 2021, with $1,677 per thousand board feet (mbf). It was $425.9 per mbf on January 21, 2020, when the first COVID-19 case in the United States was confirmed (Sahu and Kumar 2020). The average opening price from May 2011 to January 2020 was $337 per mbf, while the average opening price from February 2020 to May 2021 was $698 per mbf. The unusual fluctuations exposed lumber futures products that were originally designed to hedge uncertainties to huge risks. Therefore, there is an urgent need to find a reliable method to predict the lumber futures price, which would help enterprises and investors hedge risks and make correct decisions in the market.
In recent decades, several lumber price prediction methods have been proposed, such as ordinary least-squares regression (Mehrotra and Carter 2017), vector autoregressive model (VAR) (Song 2006), autoregressive integrated moving average model (ARIMA) (Buongiorno and Balsiger 1977, Oliveira et al. 1977, Banaś and Utnik-Banaś 2021), seasonal autoregressive moving average model (SARIMA) (Banaś and Utnik-Banaś 2021), seasonal autoregressive moving average model with exogenous variables (SARIMAX) (Banaś and Utnik-Banaś 2021), forest simulation model (FORSIM) (Buongiorno et al. 1984), and sales & operations planning network model (Marier et al. 2014). Most of the literature on lumber price prediction is based on traditional statistical models (Marier et al. 2014), econometric models (Banaś and Utnik-Banaś 2021, Buongiorno and Balsiger 1977, Mehrotra and Carter 2017, Oliveira et al. 1977, Song 2006), or mathematical models (Buongiorno et al. 1984). So far, only one paper has used a recurrent neural networks model, which is a deep learning method to predict the closing price of lumber futures in the next few days using the price obtained from the previous few days (Verly Lopes et al. 2021).
In other domains, machine learning models and deep learning models were widely used for time series forecasting. A support vector machine (SVM) method was employed to forecast the daily electrical load (Singh and Mohapatra 2021) or wind speed (Gangwar et al. 2020). A random forest method was conducted in other studies to estimate poverty (Zhao et al. 2019) or the biomass weight of wheat (Zhou et al. 2016). XGBoost was run to forecast crude oil price (Gumus and Kiran 2017) or sales of the enterprise (Gurnani et al. 2017, Ji et al. 2019). Classification and regression tree (CART) was carried out to forecast precipitation (Choubin et al. 2018) or currency exchange rate (Haeri et al. 2015). And the deep learning models, including artificial neural network (ANN), recurrent neural network (RNN), and convolutional neural network (CNN), were applied to forecast construction material prices (Mir et al. 2021), photovoltaic power (Abdel-Nasser and Mahmoud 2019), gas demand (Su et al. 2019a), stock markets (Hoseinzade and Haratizadeh 2019), or river discharges (Awchi 2014). Overall, machine learning models and deep learning models have been widely employed to predict economic indicators, socioeconomic indicators, and science indicators. Machine learning models and deep learning models are statistical approaches. Compared to the traditional econometric models, they capture the hidden nonlinear characteristics among variables and provide more accurate predictions, while the econometric models are based on strict linear assumptions (Herrera et al. 2019) and might overfit the sample and yield forecasting error (Shobana and Umamaheswari 2021).
Some previous studies predicted the future lumber price based on the past values, which is an autoregressive technique (Song 2006). Other studies use some exogenous independent variables to predict lumber prices, such as the construction confidence index (Banaś and Utnik-Banaś 2021) and specific characteristics of the lumber supply chain (Marier et al. 2014). Models that include exogenous independent variables can produce good prediction results because the exogenous variables normally contain more information. However, none of these studies included public attention as an exogenous variable. Google is the most popular search engine in the United States. Google Trends is a publicly available service provided by Google. It provides access to aggregated information about different search queries and how those queries change over time. The Google Trends index is an index measuring the search volume of different queries over time. Users can use the Google Trends index to observe changes in the query volume of certain keywords over time and compare the query volume of different keywords over time. This provides an opportunity to capture the interest and concern of the public in real time without any cost. Therefore, Google Trends index is widely used to predict economic indicators and socioeconomic indicators, such as sales, unemployment, travel, consumer confidence (Choi and Varian 2012), consumer behavior (Carrière-Swallow and Labbé 2013), housing market (Dietzel 2016), the stock price (Hu et al. 2018), and so on.
This prospective study aims to use the Google Trends index of some keywords from the previous day to predict the next day's opening price of lumber futures. Nowcasting is the process of predicting the present, the very near future, or the very recent past value of an indicator based on real-time data (Banbura et al. 2010, Chumnumpan and Shi 2019). Nowcasting the opening price of lumber futures can help investors to take appropriate actions during the premarket trading hours between 8:00 a.m. to 9:30 a.m. Eastern each trading day. It would have a beneficial impact on hedging risks and expanding trade opportunities (Dungey et al. 2009). It would also be useful in helping enterprises navigate during normal and unusual times such as a pandemic. The statistical significance of the keywords of the Google Trends index will change over time. In other words, different factors have various effects on lumber futures price in different situations. The models can dynamically select the keyword variables in different time periods. As a result, the components of variables will change to capture dynamic trends of the real world. This study fills the gap in the literature by using machine learning and deep learning models to nowcast the lumber futures prices via Google Trends index.
This paper consists of five sections. The “Data” section briefly introduces the data. The “Prediction Models” section describes the models adopted in this study. The “Results and Discussion” section presents and discusses the results, and the “Conclusion” section concludes this study.
Data
Data collection
The Chicago Mercantile Exchange lumber futures price daily data were extracted from Investing.com. The dataset includes opening price, closing price, highest price, and lowest price of lumber futures. The data are from May 2011 to May 2021, with a total of 2,523 entries of data. The opening price of lumber futures is plotted in Figure 1.
The actual Google search requests for some lumber price–related keywords were then extracted from Google Trends index to match the same time series as the lumber price datasets. Keyword variables include 2 by 4 (a length of sawn wood 2 inches thick and 4 inches wide), BDFT (board foot), CLT (cross-laminated timber), commodity, DIY (do it yourself), fire, forest products association, forestry, hardwood, harvest, home building, home improvement, home renovation, invest, logging, logs, lumber futures, lumber price, lumber yard, MDF (medium density fiberboard), OSB (oriented strand board), plywood, sawmill, softwood, stock market, timber, and wood. Research has seen an effect on the lumber prices for a reduction in the quality of softwood lumber or in that case any lumber. Hence, more general keywords were included instead of the specific kinds of lumber. For example, the Southern pine and Douglas-fir lumber, which are the two most commercially important types of softwood lumber, have not changed in strength and stiffness over the last five decades (Miyamoto et al. 2018, França et al. 2021, Shmulsky et al. 2021), and thus they were not included in the keywords.
Google Trends index will standardize the data to a scale of 0 to 100 to represent the “interest over time.” But the scale of this data set will change if the same variable is colisted with other keywords or if the time range is changed. Therefore, it is important to always extract the same combination of words in the same time range during the modeling and prediction process to avoid restandardization of the same data set to different scales. However, the many years of daily keyword data cannot be downloaded directly from Google Trends. In order to avoid restandardization, application programming interface (API) was applied to extract Google Trends index data via R. The library “gtrendsR” on R was employed to extract the Google Trends index, and it retrieves the index via APIs. The descriptive statistics of opening price and closing price of lumber futures price and the whole Google Trends index of keywords is provided in Table 1.
Variable selection
To increase the model interpretability, remove redundant or irrelevant variables, and reduce overfitting, least absolute shrinkage and selection operator (LASSO) was first applied to perform independent variable selection (Fonti and Belitser 2017). The LASSO estimate can be written as where λ ≥ 0 is a constant parameter that controls the strength of regularization. The value of λ is directly proportional to the amount of regularization (Muthukrishnan and Rohini 2016, Fonti and Belitser 2017). In the LASSO process, the variables that have nonzero coefficients after the regularization are selected as part of the model (Fonti and Belitser 2017). As a result, the lumber futures closing price, and the Google Trends index of the four terms “2 by 4,” “commodity,” “invest,” and “lumber futures” were selected as the feature inputs of the models (Table 2). Figure 2 plots the daily Google Trends index of the above keywords from May 3, 2011, to May 28, 2021.
Sample splitting
Before building up the models, the dataset was divided into two subsets: a training set and a test set, which can avoid overfitting the models and improve the accuracy of the models (LeCun et al. 2015, Roelofs et al. 2019). The models will be trained on the training set, and the fitted models will be used to estimate the predicted value in the test set, which can provide an evaluation of the models. The different splitting rate of the data set is selected in respect to the object of characteristics of the studied subjects (Tao et al. 2020, Nguyen et al. 2021) and the sample size (Tai et al. 2019). In this study, considering that the lumber price does not fluctuate abnormally until the second half of 2020 and there are thousands of entries of samples, the splitting rate of the data set is determined to be 95 percent. The training set and the test set contain 95 and 5 percent of the total sample, respectively, which means the data of the first nine and a half years (May 3, 2011, to November 24, 2020) was used as the training set, and the data of the last six months (November 25, 2020, to May 28, 2021) will be used as the test set.
Prediction Models
Machine learning (ML) models and deep learning (DL) models have emerged with the advent of big data technology and gained in popularity as frontier prediction methods (Liakos et al. 2018). Machine learning models are the algorithms of providing machines the ability to optimize the performance without being strictly programmed (Schmidt et al. 2019, Kadam et al. 2020). Machine learning models include support vector machine (SVM), random forest, XGBoost, classification and regression trees (CART), and many more (Friedman et al. 2001). Deep learning models are defined as representation-learning algorithms composed of processing units organized in input, hidden layers, and output layers (LeCun et al. 2015, Shrestha and Mahmood 2019). Deep learning models include artificial neural network (ANN), recurrent neural network (RNN), and convolutional neural network (CNN) (Miotto et al. 2018).
Machine learning models
Support vector machine.—
Support vector machine is an algorithm that maximizes a specific mathematical function based on a given data set (Noble 2006). SVM can be applied to time series prediction by introducing kernel functions (Pyo et al. 2017). In the SVM, the input vector x is mapped to the high-dimensional feature space using the nonlinear mapping function Φ(x) and run regression in the space (Wang et al. 2008). The SVM can be represented as the following equation: where is the predicted value, parameters b and wi can be estimated by minimizing the regularized risk function: where C is a regularization constant, y is the actual value, Lε is the loss function, (1/2)||w||2 is a measurement of function flatness. By introducing the kernel function K (x, y), Equation 2 can be transformed into the explicit form: where ∂i and ∂i* are the Lagrange multipliers which satisfy the condition: ∂i × ∂i* = 0, ∂i ≥ 0 and ∂i* ≥ 0 (Choudhry and Garg 2008, Wang et al. 2008). In this study, the K (x, xi) is the polynomial kernel function: where xi is the sample in the training set (Choudhry and Garg 2008).
Random Forest.—
Random forest is an algorithm that obtains the output by combining many decision trees to form forests (Breiman 2001). Specifically, it selects a bootstrap sample from the training set, which is selected randomly with replacement, and then obtains the optimal split point to split the node into two subtrees by minimizing mean squared error (MSE), which is called growing a random forest tree, Tm. After creation of M trees, the final output of random forest is defined as (Huang and Liu 2019, Peng et al. 2021, Yoon 2021):
XGBoost.—
XGBoost is a regression tree algorithm, which is also called extreme gradient boosting. XGBoost is based on the gradient boosting decision tree algorithm and applies the addition of regularization terms to control the complexity of the model, which can prevent overfitting and improve the accuracy (Peng et al. 2019). As a result, the objective functions consist of two parts: training loss L(θ) and regularization Ω(θ): where θ is the parameter (Gurnani et al. 2017, Peng et al. 2019). The training loss is defined as: where yi is the actual value. In the XGBoost, each inner node represents the value of the attribute test, and the leaf node with values represents a decision (Xie and Zhang 2021). is the output, which is the sum of all predict values form M trees and can be written in the form: where m is the number of trees, xi is the ith training sample, fm is the value for the mth tree in the functional space F (Peng et al. 2019, Xie and Zhang 2021).
The target function can be finally expressed as:
Classification and regression trees.—
Classification and regression trees (CART) is a nonparametric statistical model, which is employed for classification problems or regression problems. If the output variable is continuous, the CART model will generate a regression tree. The CART tree is a hierarchical binary tree that is built up by splitting subsets of the data set by applying all output variables to generate two subnodes repeatedly. For determining the splitting, each predictor is evaluated to discover the best cut point, based on the least-squares deviation (LSD) impurity measure, R(t) (Mahjoobi and Etemad-Shahidi 2008, Samadi et al. 2014): where Nω(t) is the weighted number of records at node t, ωi is the value of the weighting field for record i, fi is the value of the repeat field, yi is the value of the target field, and CART(t) is the mean of the output variable at node t.
Deep Learning Models
Artificial neural network model.—
The artificial neural network (ANN) model connects the units called artificial neurons to generate complex networks (Kurbatsky et al. 2014, Su et al. 2019b). In each unit, there is an activation function, f, which applies the input variables, xi, to generate the output value. The output of a unit conveyed to next unit as an input via a weighted connection. Given a unit, j, the output of this unit can be expressed as (Su et al. 2019b): where ωij is the connection weights and tj is the bias term. The activation function, fANN, is rectified linear unit activation function in this study. The ANN model in this study is composed of an input layer, seven hidden layers, and an output layer. The output layer sums up the output of units from hidden layers. Different values of hyperparameter were tested, and the model with the best performance has a batch size 8, epochs 100, an optimizer of Adam, loss function of mean squared error, and one hidden layer with 64 units in this study.
Recurrent Neural Network.—
Recurrent neural network (RNN) is a model of neural network. It applies the previous values of observations to calculate the future value by connecting the computational units from a directed circle (Selvin et al. 2017, Moghar and Hamiche 2020). However, the RNN confronts two problems: vanishing gradient and exploding gradient (Bouktif et al. 2018). As a result, long short-term memory (LSTM) was introduced to solve these problems in this study. The usually hidden layers were replaced with LSTM cells. The LSTM cells consist of input gate, forget gate, output gate, and cell state, which makes it possible to control the gradient flow and then overcome the vanishing and exploding gradient problems (Selvin et al. 2017, Bouktif et al. 2018). The LSTM cell can be expressed as (Bouktif et al. 2020): where xt is input vector at time t; and ht are output vector of hidden units at time t − 1 and time t, respectively; ft, it, and ot are forget, input, and output gate vector, respectively; ct is the cell state vector; and W∗ and b∗ are the weight matrices and bias vector parameters of the LSTM unit, respectively. In this study, the RNN model is composed of an LSTM layer with 500 units and has epochs 50, batch size 9, an optimizer of Adam, and loss function of mean squared error. The activation function and recurrent activation function are hyperbolic tangent activation function and hard sigmoid activation function, respectively.
Convolutional neural network.—
Convolutional neural network (CNN) is a class of feedforward neural networks, which can be effectively applied in image recognition, natural language processing, and time series data prediction (Lu et al. 2020). CNN consists of convolution layer, pooling layer, and fully connected layers. It extracts data features via the convolution layer and connects the units locally using the pooling layer, which reduces the redundant features (Chen et al. 2021). Then it converts the features in the previous layers to the final output using fully connected layers, which can be expressed as (Balaji et al. 2018): where is the output value of unit i at the layer j, is the output value of unit k at the layer j − 1, fCNN is the activation function. In this study, the activation function of CNN is rectified linear unit activation function. is the weight of the connection between unit k at layer j − 1 and unit i at layer j. In this study, the data is convoluted through a Conv-1D layer within 16 units, and then the max pooling layer. Next, the data are convoluted through another Conv-1D layer within 32 units, and then the global max pooling layer. The activation function is rectified linear unit. The CNN model has epochs 1500, an optimizer of Adam, and a loss function of mean squared error.
Evaluation of Models
To evaluate the performance of these models, the mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and symmetric mean absolute percentage error (SMAPE) were used as the criteria. The measures are as follows: where N is the number of training set samples or test set samples, yi is a real value at time t, and is the corresponding predicted value.
Results and Discussion
In this study, a baseline model was established, based on the naïve forecasting method, to provide the required point of comparison when evaluating all other models.1 Naïve forecasting is the method in which actual values in the last period are simply taken as predicted values in this period. In the baseline model, the opening price at the previous time step t − 1 was used to be the predicted value at the time step t.2
The prediction results of different models of the test set are shown in Figure 3, which contains 127 observations from November 25, 2020, to May 28, 2021. All four machine learning models and three deep learning models showed strong predictive ability because the predicted lumber prices are close to the actual prices.
Figure 3 shows that the random forest, XGBoost, CART, ANN, RNN, and CNN models can capture the trends and dynamics in the test set, while the SVM model fails to identify the pattern in the highest price interval, which makes the nowcasting less accurate. It should be noted that the actual lumber price in the test set is much higher than that in the training set. Most of the machine learning and deep learning models can still capture the trends and identify the pattern. This shows that the machine learning and deep learning models have the ability to extract hidden features among variables in high-dimensional and multivariate data sets in a complex and dynamic environment (Köksal et al. 2011, Wuest et al. 2016).
From the overall performance, the ANN model performs better than other models. There is a large overlap between predicted prices and actual prices, especially for the prediction of an abnormal trend of rapid growth from mid-March 2021 to early May 2021. Moreover, the ANN model provides significantly better predictions than the baseline model. Although the random forest, XGBoost, CART, and RNN models are inferior to ANN, the predicted prices of these models were highly consistent with the actual observations. SVM and CNN models have the weakest prediction effects among the machine learning and deep learning models, respectively, although predicted prices of these two models are also roughly close to the actual prices. The SVM model overestimates the lumber price from mid-March to early May significantly, and the CNN model cannot capture the trend of rapid growth very well, compared with the other two deep learning models. This result might be explained by the fact that the CNN model does not depend on any information from previous observations to make a prediction (Selvin et al. 2017).
Figure 4 compares the average prediction performance between machine learning models, deep learning models, and the baseline model. Comparing the predictive performance of all seven models shows that the ANN model performs the best overall. The MSE, MAE, MAPE, and SMAPE of the test set are the lowest among these models. This may be explained by the good self-learning, self-adapting, and self-organizing ability of the ANN model, which can analyze the patterns and rules of observations through training (Su et al. 2019b). The RNN model is the second-best prediction performance model, which could be attributed to the good ability to use information from previous lags to predict the future values by RNN (Selvin et al. 2017). XGBoost gives more accurate predictions than other machine learning models, and it is also the third-best model among all seven models. ANN, RNN, XGBoost, random forest, CART, and CNN models provide more accurate results than the baseline model. In addition, the performance of the machine learning and deep learning models are generally better than traditional time series models. For example, Banaś and Utnik-Banaś (2021) forecasted round wood prices from 2019 Q1 to 2020 Q4 in Poland using ARIMA, SARIMA, and SARIMAX models, whose MAPE was 2.57, 2.20, and 1.75 percent on average, respectively. All the models except for SVM in this study have better performance than the ARIMA model. The ANN, RNN, XGBoost, random forest, and CART models in this study are better than the SARIMA model, and the ANN and RNN are better than the SARIMAX model.
Figures 3 and 4 show that, compared with machine learning models, deep learning models are, on average, more capable of capturing the trends and providing more accurate predictions. This may result from the better overfitting reduce ability of deep learning models. This can also be seen in Figure 4. The fitting performance of the three deep learning models to the training set is worse than that of the machine learning models.
Conclusions
This study describes a new approach for nowcasting the lumber futures price using Google Trends index through machine learning models (SVM, random forest, XGBoost, and CART) and deep learning models (ANN, RNN, and CNN). We show that deep learning models generally give more accurate predictions than machine learning models. Among the seven models, the ANN model provides the best performance, followed by the RNN model. The comparison with the baseline model shows that the random forest, XGBoost, CART, ANN, RNN, and CNN models provide more accurate predictions than the baseline model. Our findings also imply that the Google Trends index, which reflects the dynamic changes of the interest and attention from the public, can provide enough information to be good predictors in nowcasting lumber futures prices.
By using the prediction methods and Google Trends index, investors can take appropriate measures to hedge risks and make profits during premarket trading hours. The high predictive power of this approach implies that the big data models should be added to the toolbox of investors and policymakers to predict other economic variables. One probable criticism to these methods being applied to predict the lumber futures price followed by appropriate actions is that it might enhance the lumber futures market volatility and further lead to the invalidation of the forecasting.
Contributor Notes
The authors are, respectively, Ph.D. Candidate, School of Forestry and Wildlife Sci., Auburn Univ., Auburn, Alabama (mzh0097@auburn.edu); Assistant Professor, Dept. of Agric. Economics & Rural Sociology, Auburn Univ., Auburn, Alabama (wenying.li@auburn.edu [corresponding author]); Regions Professor, Forest Products Development Center (FPDC), School of Forestry and Wildlife Sci., Auburn Univ., Auburn, Alabama (brianvia@auburn.edu); and Alumni Professor, School of Forestry and Wildlife Sci., Auburn Univ., Auburn, Alabama (Zhangy3@auburn.edu). This paper was received for publication in October 2021. Article no. 21-00061.