Tutorial: Create, track, and use a dataset artifact ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt Use this file to discover all available pages before exploring further.
Create, track, and use a dataset artifact with W&B.
This walkthrough demonstrates how to create, track, and use a dataset artifact.
1. Log into W&B#
Import the W&B library and log in to W&B. You will need to sign up for a free W&B account if you have not done so already.
import wandb
wandb.login()2. Initialize a run#
Use wandb.init() to intialize a run. This generates a background process to sync and log data. Provide a project name and a job type:
# Create a W&B Run. Here we specify 'dataset' as the job type since this example
# shows how to create a dataset artifact.
with wandb.init(project="artifacts-example", job_type="upload-dataset") as run:
# Your code here3. Create an artifact object#
Create an artifact object with the wandb.Artifact(). Provide a name for the artifact and a description of the file type for the name and type parameters, respectively.
For example, the following code snippet demonstrates how to create an artifact called ‘bicycle-dataset’ with a ‘dataset’ label:
artifact = wandb.Artifact(name="bicycle-dataset", type="dataset")For more information about how to construct an artifact, see Construct artifacts.
4. Add the dataset to the artifact#
Add a file to the artifact. Common file types include models and datasets. The following example adds a dataset named dataset.h5 that is saved locally on our machine to the artifact:
# Add a file to the artifact's contents
artifact.add_file(local_path="dataset.h5")Replace the filename dataset.h5 in the previous code snippet with the path to the file you want to add to the artifact.
5. Log the dataset#
Use the W&B run objects wandb.Run.log_artifact() method to both save your artifact version and declare the artifact as an output of the run.
# Save the artifact version to W&B and mark it
# as the output of this run
run.log_artifact(artifact)A 'latest' alias is created by default when you log an artifact. For more information about artifact aliases and versions, see Create a custom alias and Create new artifact versions, respectively.
Putting this together, you script so far should look like this:
import wandb
wandb.login()
with wandb.init(project="artifacts-example", job_type="upload-dataset") as run:
artifact = wandb.Artifact(name="bicycle-dataset", type="dataset")
artifact.add_file(local_path="dataset.h5")
run.log_artifact(artifact)6. Download and use the artifact#
The following code example demonstrates the steps you can take to use an artifact you have logged and saved to the W&B servers.
- First, initialize a new run object with
wandb.init(). - Second, use the run objects
wandb.Run.use_artifact()method to tell W&B what artifact to use. This returns an artifact object. - Third, use the artifacts
wandb.Artifact.download()method to download the contents of the artifact.
# Create a W&B Run. Here we specify 'training' for 'type'
# because we will use this run to track training.
with wandb.init(project="artifacts-example", job_type="training") as run:
# Query W&B for an artifact and mark it as input to this run
artifact = run.use_artifact("bicycle-dataset:latest")
# Download the artifact's contents
artifact_dir = artifact.download()Alternatively, you can use the Public API (wandb.Api) to export (or update data) data already saved in a W&B outside of a Run. See Track external files for more information.