Example Datasets¶
Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples in this section. If you haven’t downloaded the data, you can do so by running:
$ python -m yellowbrick.download
This should create a folder called data
in your current working directory with all of the datasets. You can load a specified dataset with pandas.read_csv
as follows:
import pandas as pd
data = pd.read_csv('data/concrete/concrete.csv')
The following code snippet can be found at the top of the examples/examples.ipynb
notebok in Yellowbrick. Please reference this code when trying to load a specific data set:
from yellowbrick.download import download_all
## The path to the test data sets
FIXTURES = os.path.join(os.getcwd(), "data")
## Dataset loading mechanisms
datasets = {
"bikeshare": os.path.join(FIXTURES, "bikeshare", "bikeshare.csv"),
"concrete": os.path.join(FIXTURES, "concrete", "concrete.csv"),
"credit": os.path.join(FIXTURES, "credit", "credit.csv"),
"energy": os.path.join(FIXTURES, "energy", "energy.csv"),
"game": os.path.join(FIXTURES, "game", "game.csv"),
"mushroom": os.path.join(FIXTURES, "mushroom", "mushroom.csv"),
"occupancy": os.path.join(FIXTURES, "occupancy", "occupancy.csv"),
}
def load_data(name, download=True):
"""
Loads and wrangles the passed in dataset by name.
If download is specified, this method will download any missing files.
"""
# Get the path from the datasets
path = datasets[name]
# Check if the data exists, otherwise download or raise
if not os.path.exists(path):
if download:
download_all()
else:
raise ValueError((
"'{}' dataset has not been downloaded, "
"use the download.py module to fetch datasets"
).format(name))
# Return the data frame
return pd.read_csv(path)
Note that most of the examples currently use one or more of the listed datasets for their examples (unless specifically shown otherwise). Each dataset has a README.md
with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and their associated analytical tasks:
- bikeshare: suitable for regression
- concrete: suitable for regression
- credit: suitable for classification/clustering
- energy: suitable for regression
- game: suitable for classification
- hobbies: suitable for text analysis
- mushroom: suitable for classification/clustering
- occupancy: suitable for classification