1. Introduction
In the oil and gas industry, reservoir evaluation is a key parameter for petroleum engineers to make accurate decisions during extraction [
1]. The well-production rate is one of the important parameters in reservoir evaluation, which can assist in reservoir modeling and numerical simulation and guide reservoir development strategies [
2]. From current technological developments, traditional well-production prediction generally adopts three methods: Arps decline curve prediction, machine learning methods, and combination model prediction [
3]. The Arps decline curve prediction method can achieve visualization of prediction results through software [
4]. Machine learning methods have excellent self-learning abilities and diverse algorithms, and their performance largely depends on the quality and quantity of data; however, they can efficiently process large-scale datasets [
5]. The combination model prediction method has a wide range of applicability, can reduce systematic errors, and has strong model stability [
6]. Arps decline curve prediction includes exponential decline, hyperbolic decline, and harmonic decline [
7]. You et al. proposed a method for predicting the economic recoverable reserves of SAGD units through the decline constant method based on the Arps index decline and hyperbolic decline, as well as a new approach for predicting technical recoverable reserves using the intercept divided by slope method [
8]. Chen et al. conducted a comparative analysis of three models, Arps, SPED, and MFF, and evaluated the predictive performance of the decline curve model and the combination of the decline curve and data-driven neural network model [
9]. The prediction of the Arps decline curve requires a sufficiently long production time for the oil field to detect the trend of production decline, which is suitable for analyzing constant-pressure production situations. Due to its simplification of well assumptions, it cannot accurately estimate the actual well production [
10].
Machine learning is an important branch of artificial intelligence (AI) that aims to enable computers to acquire knowledge and experience from data through the learning and automated reasoning of computer systems, and to use this knowledge and experience for pattern recognition, prediction, and decision-making [
11]. Machine learning contains a large number of algorithms and models. For well-production prediction, machine learning methods establish a linear relationship between independent variables (geological parameters, production time, etc.) and dependent variables (production) [
12]. Hou et al. applied machine learning methods to predict the porosity and permeability of reservoirs. Research has shown that logging parameters have a significant impact on the prediction results of porosity and permeability, and an optimal adaptive model can be selected [
13]. In recent decades, some algorithms in machine learning have become increasingly frequent in predicting well production. Among the numerous algorithms in machine learning, ANN stands out because of its unique structure and learning ability, making it an effective means of solving complex problems [
14]. Due to the strong nonlinear fitting ability of ANN, its prediction results are usually more accurate than those of traditional linear regression or time series analysis [
15]. In production prediction, ANN can automatically learn patterns from historical data and predict future yields based on these patterns [
16]. Amr et al. used machine learning methods to simultaneously train oil well productivity for multiple blocks. The monthly oil production is designated as the dependent variable of the model, which improves the accuracy of the prediction. They improved the robustness of the model by increasing the amount of data and studied the impact of the input variables on prediction accuracy [
11]. ANN is a very suitable method for production prediction.
In terms of production prediction, in addition to the outstanding performance of ANN, other machine learning algorithms also play an important role. Chakra et al. constructed an oil and gas cumulative production prediction model based on high-order neural networks (HONN). HONN can effectively express the linear and nonlinear relationships between input variables. The model was trained using data from a sandstone reservoir in India, and the trained model was used to predict the cumulative production of oil and gas. Research shows that even in situations where on-site data are relatively insufficient, the model still exhibits good predictive ability and high accuracy [
17]. Aizenberg et al. established an oil and gas production prediction model based on a complex neural network (MLMVN) with multi-valued neurons that can accurately achieve the dynamic prediction of oil and gas field production [
18]. Shoeibi et al. used long short-term memory (LSTM) networks to predict oil and gas production capacity and validated the effectiveness of the model through multiple numerical simulations and field data. The research results indicate that the model exhibits good predictive performance in both stable and dynamic production modes [
19]. Lolon et al. used a multivariate statistical model to evaluate the relationship between well parameters and production in the Bakken and Three Forks formations. The study found that in tight reservoirs, the total fracturing fluid volume and proppant dose during hydraulic fracturing are the main engineering parameters that affect production [
20]. Song et al. conducted a nonlinear analysis of multiple factors using the random forest method for a certain ultra-low-permeability reservoir, identifying the main influencing factors on the initial production capacity of the reservoir. They overcame the limitation of the gray wolf algorithm, which can only identify a single influencing factor using the random forest method. They combined random forest and gray wolf algorithms to predict and optimize the production capacity of oil wells, thereby achieving a fast training speed prediction model [
21]. Zhang et al. proposed an unconventional oil and gas production prediction model based on local preserving projection (LPP). This model can accurately capture the nonlinear characteristics of various parameters. Through parameter testing and comparative experiments, research has shown that the LPP-based model has good adaptability and effectiveness [
22]. In the same year, Cao et al. used machine learning algorithms to predict production capacity by combining geological data, historical production capacity, and pressure data. He used an artificial multi-layer perceptron to complete two tasks: to predict the future production capacity of existing oil wells based on historical data and to predict the production capacity of new wells based on the historical production of adjacent wells under similar geological conditions [
15]. Wang et al. proposed a method that combines multi-layer perceptron (MLP) and long short-term memory (LSTM) networks to predict shale gas production based on geological and fracturing reservoir parameters, historical data, and other information. On the basis of reservoir numerical simulation, they constructed a dataset and trained the model to achieve high accuracy [
23]. Rober S et al. developed a complete well-completion design optimization and rapid economic benefit evaluation process based on an ANN. In terms of data processing, they used a self-organizing feature map (SOM) clustering neural network to associate and classify data with reservoir types and completion features [
24]. Dong et al. used the XGBoost algorithm to predict the initial production capacity of sandstone reservoirs and improved the model loss function, enhancing the physical constraints of data mining algorithms and significantly improving the prediction accuracy [
25]. Khan et al. optimized the fracturing design for heterogeneous oil reservoirs and evaluated its accuracy using various machine learning methods. They found that the XGBoost model performed well in productivity prediction and significantly improved prediction performance [
26].
On the basis of machine learning, the combination model prediction method combines the advantages of multiple prediction methods and can combine different machine learning algorithms to improve the accuracy and stability of predictions [
11]. Berneti et al. combined the International Accounting Association (ICA) with an Artificial Intelligence Network (ANN) to develop a model for predicting oil well productivity. The model combines the local search function of the BP algorithm and the global search feature of the ICA. Based on the production data of 31 wells, the researchers use the ICA-ANN prediction model to forecast the production capacity and compare it with the ANN prediction results. The results show that the ICA-ANN model has good performance [
27]. Aditya Vyas cleverly used machine learning to link the descent curve model with well-completion parameters. They proposed an evaluation criterion by combining the best descent curve and accurate EUR prediction with machine learning, providing a new method for predicting oil well productivity [
28]. Noshi et al.’s joint decision-making based on multiple decision trees can effectively handle multidimensional data and has a strong ability to avoid overfitting [
29]. Adesina et al. significantly improved the processing efficiency of nonlinear data by mapping data to a high-dimensional space using the kernel function method [
30]. Considering that individual algorithms in machine learning have certain errors in yield prediction, the combination model prediction method provides a new approach for improving accuracy. AutoGluon is an open-source framework launched by Amazon with automated machine learning (AutoML) capabilities aimed at simplifying the training and deployment process of machine learning models [
31]. AutoGluon utilizes the available computing resources to find the most powerful models in its allocated runtime, providing a user-friendly interface and documentation support. Nick et al. found that multi-layer combinations of many models make better use of allocated training time than finding the best model. Testing 50 classification and regression tasks on Kaggle and OpenML AutoML Benchmarks using AutoGluon showed that AutoGluon is faster, more robust, and more accurate [
32].
However, there are significant limitations in the current research on the analysis of the main control factors of production capacity in intelligent production forecasting models. Most studies only analyze engineering or geological parameters separately and fail to achieve an organic combination of the two. This type of single-factor analysis method makes it difficult to fully reflect the production dynamics of oil and gas reservoirs, thereby affecting the accuracy of production capacity prediction. Therefore, there is an urgent need to develop an intelligent prediction model that comprehensively considers both geological and engineering factors to improve the accuracy of production capacity prediction. In addition, most of the current research on capacity prediction models is based on a single model. AutoGluon can automatically combine multiple models and conduct integrated learning, comprehensively utilize the advantages of each model, reduce the limitations of a single model, and enhance the stability of the trained model.
3. Case Study
This study employs ensemble techniques from hybrid machine learning methods to enhance oil well-production forecasting. Initially, various outlier detection and missing value imputation methods are applied to analyze and preprocess geological and engineering data from oilfield sites, removing outliers and filling in missing values. After data governance and target parameter selection, a relationship analysis is conducted among the parameters to filter out those with high correlations with the target parameters. The study utilizes gray relational analysis, the maximal mutual information method, AutoGluon feature importance evaluation, and SHAP value analysis, combined with the entropy weight method for comprehensive evaluation, to determine appropriate input parameters. Finally, the AutoGluon framework, which integrates multiple models, is used for target output prediction, with RMSE employed to validate the model performance, while MSE, MAE, and R2 are used to assess the model’s performance on both the training and testing datasets. All data have been desensitized for confidential consideration.
3.1. Abnormal Data Situation
By normalizing the on-site data and scaling them uniformly to the range [0, 1], the distribution of the data can be visually observed using a box plot. A box plot is a statistical chart designed to display the dispersion of a dataset, primarily reflecting the characteristics of the original data distribution and allowing for the comparison of distributional characteristics across multiple datasets. In the box plot, the middle line represents the median, while the upper and lower boundaries of the box represent the upper and lower quartiles (75% and 25%, respectively). The outer edges of the plot represent the outlier cutoff points. The distance from the box to the outer edges is typically 1.5 times the interquartile range. Data points within this range are considered inner limits, while data points outside this range are considered outer limits. Generally, data points beyond the outer limits are regarded as extreme outliers (IQR method). Perform normalization on the data and then construct a box plot, as shown in
Figure 2.
Due to recording errors, magnitude errors, calculation errors, deduction errors, and other reasons, there are many outliers in the on-site data, often reflected in the form of outlier data. The distribution of the data range is displayed in a histogram [
1], as shown in
Figure 3.
As can be seen from the graph, the distribution of on-site data is mostly uneven, with many outliers. There are two reasons for determining outliers: one is that there are a large number of 0 s in sparse data, which affects the overall data distribution, and the second reason is abnormal values caused by erroneous construction records, as shown in
Figure 4. For these two types of data, they need to be treated differently. Although sparse data are classified as outliers in the mathematical distribution, they do not affect model training, while outliers caused by recording errors can significantly affect the training process and the prediction accuracy of machine learning models. This part needs to be manually corrected.
3.2. Data Preprocessing
3.2.1. Data Missing Situation
In addition to the issue of data anomalies, there are also serious missing data problems on-site. In traditional processing methods for big data, data fields with missing values exceeding 60% are usually deleted directly. However, due to the small amount of on-site data, these fields cannot be deleted. Therefore, special attention should be paid to data filling during data preprocessing.
As can be seen from
Figure 5, only nearly 150 wells have complete data, while most wells lack ten to fifteen types of data, resulting in a high data loss rate.
As can be observed from
Figure 6, the parameter data statistics are also incomplete, with 38.7% of parameters having a missing rate between 60% and 70%, and 3.2% of parameters having a missing rate exceeding 80%. These include important parameters such as porosity, permeability, and saturation.
3.2.2. Outlier Handling
The primary foundation of outlier handling is to identify outlier data, and commonly used outlier identification methods include the IQR method, MAD method, and 3sigma method.
The IQR method is based on the lower quartile Q1 and upper quartile Q3 of the dataset for judgment. The size between the upper and lower quartiles is called IQR, while a value that exceeds 1.5 times IQR beyond the upper and lower quartiles is considered an outlier. The dataset is arranged from small to large, with the lower quartile Q1 located in the fourth quarter and the upper quartile Q3 located in the third quarter. The difference between Q3 and Q1 is the IQR. By using box plots and bar charts, it is possible to visually observe the data points outside the whisker axis, namely, outlier points, as shown in
Figure 7. The green dashed line represents the lower boundary of the IQR (Q1 − 1.5 × IQR), and any data below this value are considered outliers. The purple dashed line represents the upper boundary of IQR (Q3 + 1.5× IQR), and any data above this value are also considered outliers. The light blue Kernel Density Estimate (KDE) curve illustrates the smooth probability density distribution of the data, illustrating the central tendency and dispersion of the da-ta values. This curve, combined with the histogram, aids in visually understanding the data distribution and highlighting the location of outliers. The curves depicted in
Figure 8 and
Figure 9 carry the same significance as the curves in
Figure 7. The lower plot in
Figure 7 shows the box plot distribution of the data, with points outside the two black lines identified as outliers. The combination of the histogram and box plot provides a better observation of both the quantity and distribution of outliers.
The MAD method assumes that the data follow a normal distribution, and the values within the middle 50% area of the normal distribution are normal values, while the remaining 50% of the area on both sides are outliers. The outlier situation can be viewed in the histogram, and the judgment boundary of the MAD is shown in
Figure 8.
The 3 σ method assumes that the data follow a normal distribution, with the mean as the center, and the probability within plus or minus 3 σ is 99.7%. Therefore, the probability of values outside the mean value of 3 σ occurring is 0.3%, which is a very rare and small probability event. Therefore, it can be identified as an outlier, and the outlier situation can be viewed in the histogram. The judgment boundary of 3 σ is shown in
Figure 9.
This study adopts the density-based noise unsupervised spatial clustering (DBSCAN) clustering algorithm, which performs multidimensional clustering analysis based on Euclidean distance and automatically and accurately identifies outlier data points as outliers. DBSCAN only requires specifying two parameters: the search neighborhood radius ε and the minimum number of points contained within the search radius MinPts. In the clustering process, any data point is determined to have neighboring points within its neighborhood radius ε, and points with a number of neighboring points greater than MinPts are identified as core points. Points with fewer neighboring points than MinPts, but adjacent to the core point, are identified as boundary points. The core and boundary points always belong to a certain category cluster, and points that do not belong to any cluster are identified as noise points. The core points and noise points are shown in
Figure 10. In the figure, purple, yellow, and green dots represent core points, while black dots indicate noise points, which are also considered outliers.
Choosing an appropriate method is crucial for the accuracy of results in the process of outlier detection. Compared with other commonly used outlier detection methods, such as IQR, MAD, and the triple standard deviation method, the DBSCAN method exhibits more reasonable characteristics in multiple aspects. Firstly, DBSCAN’s density-based clustering method can automatically identify clusters and noise points in data without making assumptions about the distribution pattern of the data, while traditional IQR, MAD, and triple standard deviation methods typically assume that the data follows a normal distribution or a relatively uniform distribution, which may not always hold true in practical applications. Secondly, DBSCAN can effectively handle data with nonlinear structures and has stronger adaptability. In addition, DBSCAN does not require a pre-set number of clusters and can automatically determine the number and shape of clusters, which is particularly important for datasets with unknown or complex data features. Taking into account these advantages and data characteristics, the DBSCAN method is more reasonable and effective in outlier detection. Therefore, this method was chosen for this study.
3.2.3. Data Filling
In data analysis and machine learning projects, handling missing values is one of the key steps in ensuring data quality and model performance. If missing values are not handled properly, this may lead to analysis bias and even affect the predictive ability of the model. Therefore, in this project, we adopted a comprehensive strategy that combines mean imputation and the K-Nearest Neighbor (KNN) algorithm to ensure that missing values in the data are properly and effectively filled.
Firstly, we conducted preliminary processing on the geographical features in the dataset. For the missing values in these feature columns, we filled them with column means. This method can reduce the information loss caused by missing values while preserving the overall trend of the data, providing a relatively complete data foundation for subsequent processing steps.
After completing the preliminary processing, we further used the K-Nearest Neighbor (KNN) algorithm to fill in the missing values in the data. KNN is an instance-based learning method that infers possible missing values by referencing several neighbors in the dataset that are most similar to the missing value samples. We set the number of reference neighbors to 50, which means that when filling in each missing value, the algorithm searches for the 50 most similar samples and infers the missing value based on the data of these neighbors. This method fully utilizes the similarity information in the data, providing more accurate and reasonable filling results, thereby effectively improving the integrity of the data.
After filling in using the above method, we updated the entire dataset to ensure that all missing values were properly processed. The situation before and after missing value processing is shown in
Figure 11. In the figure, blue dots represent the original data distribution, while green dots show the data distribution after imputation.
3.3. Feature Extraction
Pearson coefficient analysis was conducted on the extracted geological and engineering data. This analysis involved determining the coefficient value for each pair of parameters. Following the analysis of all parameters,
Figure 12 was generated. The colors depicted in the heatmap correspond to the magnitude of the data. Specifically, Pearson coefficients closer to 1 are represented by redder hues in the heatmap, while coefficients nearer to −1 appear bluer [
35]. This visual representation provides a clear and intuitive understanding of the correlations present within the dataset.
When the absolute value of the Pearson coefficient is greater than 0.4, a linear correlation is generally considered. By extracting parameters with Pearson correlation coefficient absolute values greater than 0.4, as shown in
Figure 13, a linear relationship test was conducted between each parameter and 1800 days of cumulative oil production. The results show that although there is a certain linear correlation between the construction parameters, geological parameters, and cumulative oil production, this correlation is not strong. This indicates that there may be nonlinear relationships or other complex influencing factors that have not been fully identified. In order to comprehensively understand and evaluate the impact of these parameters on cumulative oil production, this article has decided to introduce more analytical methods. Subsequently, gray relational analysis, the maximum mutual information method, AutoGluon feature importance, and SHAP value analysis will be used, combined with the entropy weight method for comprehensive evaluation, to deeply analyze and determine key influencing factors. These comprehensive methods are expected to reveal more potential relationships and provide a more accurate basis for subsequent optimization decisions.
For the target variable of 1800 days of cumulative oil production, this study comprehensively utilized various methods, such as gray relational analysis, the maximum mutual information method, AutoGluon feature importance assessment, and SHAP value analysis, combined with the entropy weight method, for a comprehensive evaluation. Through this series of analytical methods, the impact of each parameter on the target variable was successfully identified and quantified, and the final evaluation results of the main controlling factors are shown in
Figure 14.
In the feature extraction stage, this article selects features from the raw data that are highly correlated with the target parameters. The aim is to reduce the dimensionality of the data, enhance its nonlinearity, and increase its expressive power in order to more effectively adapt it to machine learning models and algorithms. To achieve this goal, this article primarily uses the principal component analysis (PCA) method for feature extraction. This method minimizes information loss by preserving the most representative information in the original data. After applying principal component analysis (PCA), this paper selected 19 geological and engineering parameters from an initial set of 29 variables, using the entropy weight method to rank them by their scores from high to low. These parameters were chosen to serve as input features for subsequent model training. The geological parameters include aspects such as the total gel liquid volume, liquid intensity, liquid-to-sand ratio, pre-pad liquid proportion, total sand volume, average sand ratio, sand intensity, total liquid volume, perforations, and net liquid volume. On the engineering side, the selected parameters include minimum permeability (K_min), average pressure (P_avg), minimum pressure (P_min), maximum oil saturation (So_max), maximum water saturation (Sw_max), average oil saturation (So_avg), minimum water saturation (Sw_min), thickness, and average water saturation (Sw_avg). Among the preferred parameters, Total Gel Liquid Volume, Liquid intensity, and Total Liquid Volume are positively correlated to the yield, while other preferred parameters are negatively correlated.
3.4. Model Training
In this article, the AutoGluon machine learning framework was utilized for rapid model training and optimization, achieving higher accuracy predictions by integrating multiple models without the need for manual hyperparameter search. Specifically, various ensemble learning techniques from AutoGluon, including stacking, k-fold cross-validated bagging, and multi-level stacking, were employed to enhance the model’s fitting and generalization performance.
Stacking technology trains multiple models independently on the same dataset and uses linear models to calculate the weighted average of all model predictions. Bagging through k-fold cross-validation effectively prevents model overfitting by performing k-fold cross-validation on all the models and obtaining the average output. Multi-layer stacking combines the original data with the results of single stacking to form a new linear weighted model, further improving the prediction accuracy.
In the specific implementation, the model was trained for regression problems with the key parameters set to optimize the training process. By specifying the model save path, controlling the training time, and utilizing GPU-accelerated computing, the training efficiency of the model was significantly improved. Additionally, 5-fold cross-validation was employed to enhance the model’s generalization ability and ensure the effective application of stacking and bagging techniques through appropriate hierarchical control.
Through these methods and parameter settings, AutoGluon was able to automatically select and combine multiple models without human intervention, significantly improving the accuracy and efficiency of predictions.
In the data preprocessing stage, the DBSCAN algorithm is used to identify and remove outliers, and the mean and KNN methods are used to fill in missing values. Subsequently, the data set was divided into a training set and a test set, with the test set accounting for 20% and not participating in model training. Based on the AutoGluon framework, bagging technology with 5-fold cross-validation and single-layer stacking integrated learning strategy were implemented. The RMSE was selected as the loss function, and the ability of the model to fit the actual data was evaluated using R
2. According to the results shown in
Figure 15, the closeness between the predicted value of the model and the real value is measured by the distance between the predicted value and the red dotted line. The closer the distance, the better the prediction effect. The blue marks represent the prediction performance on the training set, the yellow marks represent the prediction performance on the test set, and the points falling in the green area indicate that the prediction accuracy is above 85%. It is worth noting that although the R
2 value on the training set reached 0.79, indicating good model fit, the R
2 value on the test set was only 0.23, which may indicate overfitting of the model.
4. Conclusions
This study aims to preliminarily construct a production capacity prediction model for a specific oilfield by employing regression analysis methods on a dataset containing 435 records and 19 features, with the goal of predicting the cumulative oil production over 1800 days. In the data preprocessing phase, we utilized a combination of graphical analysis and numerical processing methods. Specifically, we used the K-Nearest Neighbors (KNN) algorithm to impute missing values and employed the entropy weight method to conduct a comprehensive weighting of various key factors, extracting critical features to prepare for model training. Ultimately, we achieved automated model training using the AutoGluon framework, efficiently completing the prediction task.
During the model training process, we applied various algorithms, including Random Forest, CatBoost, Extra Trees, Neural Networks, XGBoost, NeuralNetTorch, and LightGBM. A comparative analysis revealed that the LightGBM model performed the best, with a root mean square error (RMSE) of 4175.08 on the training set and a coefficient of determination R2 of 0.79, indicating that the model could explain 79% of the data variability. However, during the validation process, especially on the test set, the model’s performance significantly declined, with the RMSE increasing to 11,113.33, and the R2 dropping to 0.23, revealing a clear overfitting issue. This result emphasizes the importance of the model generalization capability. Future research will explore more advanced models and algorithms based on this foundation to further improve the prediction accuracy and reliability, providing more scientific guidance for the overall oil and gas development process.