FAQ

How can I change my username and password?

The username and password is tied to the experiments you have created. For example, if I log in with the username/password: megan/megan and start an experiment, then I would need to log back in with the same username and password to see those experiments. The username and password, however, does not limit your access to Driverless AI. If you want to use a new user name and password, you can log in again with a new username and password, but keep in mind that you won’t see your old experiments.

How can I upgrade to a newer version of Driverless AI?

Driverless AI provides the following set of commands that can you can run on the Linux machine that’s running Driverless AI. Note that these commands only work on Linux environments. These commands are not available for Mac and Windows installations.

h2oai stop     (Stop all instances)
h2oai start    (Start an instance)
h2oai restart  (Restart instance)
h2oai clean    (Removes old containers)
h2oai purge    (Removes old containers and purges tmp and log)
h2oai upgrade  (Upgrades image to Developer latest *unstable*)
h2oai update   (Upgrades image to latest Release)
h2oai ssh      (Attaches to running docker)
h2oai log      (Tails the running server log)
h2oai jupyter  (Fetch the Jupyter URL with token)

To upgrade to the latest version, stop Driverless AI, then run the h2oai update command.

What kind of authentication is supported in Driverless AI?

Driverless AI supports LDAP and PAM authentication. This can be configured by setting the appropriate environment variables in the config.toml file or by specifying the environment variables when starting Driverless AI. (Refer to Data Connectors and The Config.toml File for more information.) Examples for enabling PAM and LDAP authentication will be documented in the next release.

# Authentication
#  unvalidated : Accepts user id and password, does not validate password
#  none : Does not ask for user id or password, authenticated as admin
#  pam :  Accepts user id and password, Validates against user against the operating system
#  ldap : Accepts user id and password, Validates against an ldap server, look for additional settings under LDAP settings
authentication_method = "unvalidated"```

#Ldap Settings
ldap_server = ""
ldap_port = ""
ldap_dc = ""

# Configurations for a HDFS data source
# Path of hdfs coresite.xml
core_site_xml_path = ""
# Path of the dai principal key tab file
key_tab_path = ""

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are described below
#   Principal : Authenticate with HDFS with a principal user
#   Keytab : Authenticate with a Key tab, preferably the keytab is created for the dai application
#   Impersonate with Keytab : Login with Impersonation using a dai keytab
hdfs_auth_type = "keytab"
# DAI recomends a user to be created in kerberos for the driverless ai application
# specificy the kerberos app principal user below
hdfs_app_principal_user = ""
# Specify the user id of the current user here as user@realm
hdfs_app_login_user = ""
#
hdfs_app_jvm_args = ""

#S3 Connector credentials
aws_access_key_id = ""
aws_secret_access_key = ""

Can I set up SSL on Driverless AI?

Yes, you can set up HTTPS/SSL on Driverless AI running in an AWS environment. HTTPS/SSL needs to be configured on the host machine, and the necessary ports will need to be opened on the AWS side. You will need to have your own SSL cert or you can create a self signed cert for yourself.

The following is a very simple example showing how to configure HTTPS with a proxy pass to the port on the container 12345 with the keys placed in /etc/nginx/. Replace <server_name> with your server name.

server {
    listen 80;
    return 301 https://$host$request_uri;
}

server {
    listen 443;

    # Specify your server name here
    server_name <server_name>;

    ssl_certificate           /etc/nginx/cert.crt;
    ssl_certificate_key       /etc/nginx/cert.key;
    ssl on;
    ssl_session_cache  builtin:1000  shared:SSL:10m;
    ssl_protocols  TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_prefer_server_ciphers on;

    access_log            /var/log/nginx/dai.access.log;

    location / {
      proxy_set_header        Host $host;
      proxy_set_header        X-Real-IP $remote_addr;
      proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header        X-Forwarded-Proto $scheme;

      # Fix the “It appears that your reverse proxy set up is broken" error.
      proxy_pass          http://localhost:12345;
      proxy_read_timeout  90;

      # Specify your server name for the redirect
      proxy_redirect      http://localhost:12345 https://<server_name>;
    }
}

More information about SSL for Nginx in Ubuntu 16.04 can be found here: https://www.digitalocean.com/community/tutorials/how-to-create-a-self-signed-ssl-certificate-for-nginx-in-ubuntu-16-04.

Is there a file size limit for datasets?

The file size for datasets is limited by GPU memory, but we continue to make optimizations for getting more data into an experiment.

How does Driverless AI detect the ID column?

The ID column logic is that the column is named ‘id’, ‘Id’, ‘ID’ or ‘iD’ exactly. (It does not check the number of unique values.) For now, if you want to ensure that your ID column is downloaded with the predictions, then you would want to name it one of those names.

Can Driverless AI handle data with missing values/nulls?

Yes, data that is imported into Driverless AI can include missing values. Feature engineering is fully aware of missing values, and missing values are treated as information - either as a special categorical level or as a special number. So for target encoding, for example, rows with a certain missing feature will belong to the same group; for clustering, we impute missing values; for frequency encoding, we count the number of rows that have a certain missing feature.

If I drop several columns from the Train dataset, will Driveless AI understand that it needs to drop the same columns from the Test dataset?

If you drop columns from the dataset, Driverless AI will do the same on the test dataset.

Which algorithms are used in Driverless AI?

Currently we only use XGBoost GPU for building models and testing the engineered features. Additional GPU algorithms will be added at a later date.

Does Driverless AI perform internal or external validation?

Driverless AI does internal validation when only training data is provided. It does external validation when training and validation data are provided. In either scenario, the validation data is used for all parameter tuning (models and features), not just for feature selection. Parameter tuning includes target transformation, model selection, feature engineering, feature selection, stacking, etc.

Specifically:

  • Internal validation (only training data given):
    • Ideal when data is close to iid
    • Internal holdouts are used for parameter tuning
    • Will do the full spectrum from single holdout split to 5-fold CV, depending on accuracy settings
    • No need to split training data manually
    • Final models are trained using CV on the training data
  • External validation (training + validation data given):
    • Ideal when there’s drift in the data
    • No training data wasted during training since training data not used for parameter tuning
    • Entire validation set used for parameter tuning
    • No CV possible, since we explicitly do not want to overfit on the training data

Tip: If you want both training and validation data to be used for parameter tuning (the training process), just concatenate the datasets together and turn them both into training data for the “internal validation” method.

How does Driverless AI prevent overfitting?

Driverless AI performs a number of checks to prevent overfitting. For example, during certain transformations, Driverless AI calculates the average on out-of-fold data using cross validation. Driverless AI also performs early stopping, ensuring that the model build will stop when it ceases to improve. And additional steps to prevent overfitting include checking for IID and avoiding leakage during feature engineering.

A blog post describing Driverless AI overfitting protection in greater detail is currently in development.

What can I do if my training and validation data points are not identical?

If you feel that training and validation data points are not identical, you can optionally provide an observation weights column (such as exponential weighting in time, different weights for valid vs train, etc.). All of our algorithms and metrics in Driverless AI support observation weights.

How does Driverless AI handle fold assignments for weighted data?

Currently, Driverless AI does not take the weights into account during fold creation, but you can provide a fold column to enforce your own grouping, i.e., to keep rows that belong to the same group together (either in train or valid). The fold column has to be a categorical column (integers ok) that assigns a group ID to each row. (It needs to have at least 5 groups, since we do up to 5-fold CV.)

Where can I get details of the various transformations performed in an experiment?

Inside the /tmp folder, you will see a folder with your experiment ID, and within that folder is a *logs_<experiment>.zip file. This zip file includes summary information, log information, and a gene_summary.txt file with details of the transformations used in the experiment.

How can I download the predictions onto the machine where Driverless AI is running?

When you select Score on Another Dataset, the predictions will be automatically downloaded to the machine where Driverless AI is running. It will be saved in the following locations:

  • Training Data Predictions: tmp/experiment_name/train_preds.csv
  • Testing Data Predictions: tmp/experiment_name/test_preds.csv
  • New Data Predictions: tmp/experiment_name/automatically_generated_name. Note that the automatically generated name will match the name of the file downloaded to your local computer.