Training, Testing, and Input Data
The full code for this and all other user guides can be found in our user guide tutorial.
The first step to getting started with Pyreal is to prepare your data.
Pyreal expects data in the format of Pandas DataFrames. Each row refers to one data instance (a person, place, thing, or entity), and each column refers to a feature, or piece of information about that instance. Column headers are the names of feature. Each instance may optionally have an instance ID, which can either be stored as the DataFrame's indices (row IDs) or as a separate column.
For example, a part of your data may look like:
There are three categories of data relevant to ML decision-making: training data, testing data, and input data.
The training data is used to train the ML model and explainers. The testing data is used to evaluate the performance of the ML model (ie., how accurately it makes predictions). The input data is the data that you actively wish to get predictions on and understand better.
For training and test data, we will usually have the ground truth values (the "correct" answer for the value your model tries to predict, often referred to as y-values) for all rows of data.
For example, if we are trying to predict house prices, you would have additional information about the price of houses in your training/testing datasets. Pyreal expects these target values as pandas Series.
For the input data, we do not know the ground-truth, we we use the ML model to get a prediction.
Sample Code
For this user guide, our examples will use a smaller version of the Ames Housing Dataset, with just 8 key features. You can load in sample data using the Pyreal sample_applications
module, and use the train_test_split
function from sklearn to split your data into training and testing sets.
Last updated