URLearning
This page is now deprecated. Please visit urlearning.org for the most up-to-date information.
This page will no longer be maintained, so links may become broken when my cs.helsinki account expires.
This is software we have developed for learning a Bayesian network structure for a dataset which optimizes the score for a dataset. The c++ source code is available as a netbeans project. It requires boost. This file describes the necessary steps to configure netbeans to compile the code. This file describes the necessary steps to add a new configuration (algorithm) to the project.
You can also clone the developer version of the code with this command: hg clone https://bitbucket.org/bmmalone/urlearning-cpp
Furthermore, a set of scripts are available which implement most of the developments we published in UAI 2014 and AAAI 2014. They are available from the following repository: hg clone ssh://[email protected]/bmmalone/bnscripts. The readme.uai2014.txt file describes the prerequisites and necessary steps to run the code.
There is also an older Java version available. It is not as efficient as the c++ version, but it includes more features and does not require any external dependencies.
Datasets
These packages contain all of the files necessary to reproduce the experiments in many of our papers.
Papers prior to UAI 2013
Published datasets, (csv only) - The input files necessary to reproduce the experiments in most of our papers prior to UAI 2013, including the input datasets (csv folder). Most of these datasets are processed versions from the UCI machine learning repository. Continuous variables were binarized around their mean. Finally, each value of categorical variables were mapped arbitrary to integers (e.g., if a categorical variable had four categories, the categories would be mapped arbitrarily to {0, 1, 2, 3}); using these values, the categorical values were binarized around the mean. For most datasets, records with missing values were removed. This process sometimes results in variables with only a single observed value; these variables sometimes affect the scores in unexpected ways, especially fNML. I am working to remove these datasets.
Scores were calculated for these datasets by setting a parent limit of 8; furthermore a time limit of no more than 10 minutes for actual score calculation times was imposed on each variable (note that post-processing pruning and writing to disk are not included in this limit). The scores were calculated on a node which has XXX. This spreadsheet (Published Datasets sheet) gives runtime information on using the score program in the C++ version of URLearning to calculate, prune, and write the scores to disk. Due to technical problems with the automation, some of the running times may not be exactly correct; I am working to correct these. The following scores are available:
Unpublished datasets - The preprocessing scheme described for the published datasets may not reflect real-world usage scenarios, so I recalculated the scores using a more sensible realistic preprocessing step. First, records with missing values were removed. Next, continuous variables were discretized according to the NML-optimal histogram by Kontkanen and Myllymaki. Then, categorical variables with more than 10 values were typically removed (a note is included if some other step was taken to handle categorical variables with large cardinality). Finally, variables with a single value were removed from the dataset.
This preprocessing scheme results in more large cardinality variables than the simple scheme used for the published results. Consequently, it can result in more parent set pruning for scoring functions with a heavy complexity penalty, such as BIC. On the other hand, larger parent sets can be more informative, so that can reduce the amount of parent set pruning. An interesting venue for future work is to consider how this preprocessing affects learning.
Experiment set 2 - All of the files to reproduce the experiments in Section 5.4 of the paper, including the synthetic networks (Net folder), sampled datasets (Csv folder) and parent set scores (pss folder). This spreadsheet (UAI 2013 sheet) gives runtime information on using the score program in the C++ version of URLearning to calculate the necessary parent set scores from the sampled datasets on nodes which have 32GB of RAM and 2 Intel Xeon E5540 2.53GHz CPU:s. The CPUs have 4 cores, so one node has a total of 8 cores. The cores are using hyperthreading (so 16 threads total, I think). 10 threads were used during the score calculations.
Published datasets (csv, pss, binary scores) - All of the files to reproduce the experiments in our AAAI 2014 and UAI 2014 papers.
File Formats
Unless otherwise noted, the files use the following formats.
Hugin net (*.net) - for representing Bayesian networks
Comma separated value (*.csv) - for datasets. Many of the datasets include a header row that gives the names of the variables, but some do not.
Parent set scores (*.pss) - for giving the parent set scores which most exact structure learning programs assume as input. Mark Bartlett and I are currently formalizing this format; some notes are available in this spreadsheet. Mark has kindly extended the GOBNILP program to use this format by recompiling GOBNILP using this version of the probdata_bn.c file. GOBNILP must then be run using the -f=pss command line argument.