Examples Using HloEnv

This documentation only covers the Python interface. We will walk through two simple examples that make use of several HloEnv features. We first show how to read in an HLO text file and turn it into HLO graph features which can be used for implementing a user-defined decision-making agent. We then present a very simple decision-making agent where the policy randomly choose from available actions.

Playing with HLO Graph Features

First make sure your current working directory is correct.

$ cd /path/to/hloenv/examples

The HloEnv module holds most functionality, so we usually import it first.

import os
import pathlib
import numpy as np
from hloenv import HloEnv

Pick one hlo text file that we want to take a closer look.

hlo_path = os.path.join(
pathlib.Path(__file__).parent.absolute(),
  "hlo_texts/jax-md/module_0013.jit__lambda_.7.before_optimizations.txt"
)

Now we are ready to create a basic hlo env object on GPU backend. We haven’t worked on other ML hardwares so current GPU is the only backend HloEnv supports.

hlo_env = HloEnv(hlo_path, "gpu")

The hlo env can automatically extract features from hlo text files and organize them into another class HloGraph.

hlo_graph = hlo_env.get_hlo_graph()

To make the features more array programming friendly, all the graph features in HloGraph are organized in the form of CSR. There are three types of features: global graph feature, node feature, and in/out edge feature, all serve as the accessible members of HloGraph object. Details are in the following table:

Global Graph Features
Feature Name Description
out_edge_offsets The offset index to the actual out edge node ID indices array
out_edge_indices The out edge node ID indices array
in_edge_offsets The offset index to the actual in edge node ID indices array
in_edge_indices The in edge node ID indices array
alternative_indices The indices to all the kAlternative nodes
opcode_attr_counts Number of attributes in HLO opcode

All edge features are vectors of the length of number of edges in the HLO graph. In/Out edge features share the same feature set as follows.

In and Out Edge Features
Feature Name Description
uids Unique ID of the edge, a concatination of source and destination nodes uids
srcs Node index of source node
dsts Node index of destination node
dims Dimension of the tensor flows by this edge
layout Layout of the tensor flows by this edge
lehmercodes The Lehmer code (a better embedding) of the tensor layout
types Edge type is one of the following: outside any fusion, inside fusion, and cross fusion
dtypes Data type of the tensor flows by this edge

All node features are vectors of the length of number of HloInstructions (nodes) in the HloModule (HloGraph).

Node Features
Feature Name Description
uids Unique ID of an HloInstruction
gids Sub computation ID the HloInstruction belongs to, 0 means in main computation.
normalized_num_group_inst If an HloInstruction is inside a sub-computation, normalized_num_group_inst is the reciprocal of the total number of instructions in a sub-computation. This can serve as a weighting parameter for an instruction’s impact
num_users Number of HloInstructions that uses the result of this HloInstruction
num_operands Number of HloInstructions whose results this HloInstruction uses
opcodes HLO opcode index, as defined here
opcode_attrs Unique attribute embeddings for each opcode
num_opcode_attrs List of pairs, each pair contains the number of integer attribute and the number of enum attribute in opcode_attrs
is_alternative List of boolean that shows if the HloInstruction is kAlternative
is_in_fusion List of boolean that shows if the HloInstruction is inside a fused computation
in_tensor_sizes The total input tensor size from all operands of this HloInstruction
out_tensor_sizes The output tensor size of this HloInstruction
has_max_in_tensor List of boolean that shows if one of the operands contains the max input tensor size
has_max_out_tensor List of boolean that shows if the output tensor size has the maximum size
names List of strings that shows the names of the HloInstruction

The full-size code can be found here. In our second example, we will show you how to use these features to create a simple decision-making agent and run XLA optimizations using it.

Defining a Custom Pipeline

All of the passes that were initially defined in the full XLA optimization pipeline for compilation for GPU have been included in HloEnv. A list of all of these passes can be found in List of Enabled XLA Hlo Passes.

A pass can be created by wrapping a HloPass object within a Pass object. This Pass can then be run within the HloEnv to modify the HloModule object loaded in the HloEnv.

hlo_env = HloEnv("path/to/hlo.txt", "gpu")
fusion_pass = Pass(HloPass.GpuInstructionFusion(True))
hlo_env.run(fusion_pass)

Passes can be organized into Pipeline objects, which can also be run by the HloEnv.

op_splitter_pass = Pass(HloPass.VariadicOpSplitter())

fusion_pipeline = Pipeline("fusion-pre")
fusion_pipeline.add_pass(op_splitter_pass)

# You can also add a HloPass directly without first wrapping it in Pass
fusion_pipeline.add_pass(HloPass.GpuInstructionFusion(True))

hlo_env.run(fusion_pipeline)

A Pipeline can contain other Pipelines and Passes recursively, for example:

fusion_pipeline_pre = Pipeline("fusion-pre")
fusion_pipeline_pre.add_pass(HloPass.VariadicOpSplitter())

self.fusion_pipeline_full = Pipeline("fusion")
self.fusion_pipeline_full.add(fusion_pipeline_pre)
self.fusion_pipeline_full.add(HloPass.GpuInstructionFusion(True))

To convert a pass into a dry pass we wrap it in an AltPipeline. All rewrites performed by an AltPiplines are captures and converted into alternatives. If you wrap multiple passes or pipelines with this AltPipeline wrapper, all modifcations performed by each of those passes/pipelines will be in dry mode.

# Wrapping a single pass in an AltPipeline
fusion_dry_pass = AltPipeline(
  Pass(
    HloPass.GpuInstructionFusion(True),  # may_duplicate
  )
)

fusion_pipeline_post = Pipeline("fusion-post")
fusion_pipeline_post.add_pass(HloPass.FusionMerger())
fusion_pipeline_post.add_pass(HloPass.GpuMultiOutputFusion())
fusion_pipeline_post.add_pass(HloPass.HloCSE(True, True))

# Wrapping multiple passes in an AltPipeline
fusion_pipeline_post_dry = AltPipeline(fusion_pipeline_post)

# Adding a pass directly to an AltPipeline
fusion_pipeline_post_dry.add_pass(HloPass.HloDCE())

A sample General Fusion Pipeline can be found in examples/general_fusion_pipeline.py which contains the full XLA optimization pipeline, except we replace the vertical fusion pipeline with our custom General Fusion pass.

A Simple Decision-making Agent

We here present a very simple decision-making agent that randomly chooses from available actions in an optimization loop. The loop will isolate out the graph rewrite in an XLA pass, and layout the decisions to choose. At a high level, the optimization loop follows these steps:

  • run pre_pass_optimizations
  • enter optimization loop
    • run pre_dry_pass_passes
    • open pass_dry_run
    • choose an action
    • apply the action
    • run post_dry_pass_passes
  • run post_pass_optimizations

We can regard the pre_pass_optimizations as the pre-processing stage and post_pass_optimizations as the post-processing stage. So they are not included in the optimization loop.

Every step of pass_dry_run will expose the alternatives (i.e. action space) to users. Note that it is also surrounded by pre_dry_pass_passes and post_dry_pass_passes for some pre/post processing. They are included in the optimization loop.

Here we are interested in GeneralFusion pipeline. All the above described steps are implemented and scheduled in the GeneralFusionPipeline class, which is a sample pipeline that we have provided in examples/general_fusion_pipeline.py

from general_fusion_pipeline import GeneralFusionPipeline
from hloenv import AltPipeline, HloEnv, HloPass, Pass, Pipeline

hlo_env = HloEnv(hlo_path, "gpu")
general_fusion_pipeline = GeneralFusionPipeline(hlo_env)

The code of the optimization loop looks like this:

hlo_env.run(general_fusion_pipeline.pre_pass_optimizations)

num_alts = 1
while num_alts > 0:
  hlo_env.run(general_fusion_pipeline.pre_dry_pass_passes)
  # Open up the action space
  hlo_env.run(general_fusion_pipeline.pass_dry_run)

  # Get features from hlo_env
  hlo_graph = hlo_env.get_hlo_graph(do_hash_verification=False)
  num_alts = len(hlo_graph.alternative_indices)

  if num_alts > 0:
    # Obtain a probablity distribution over the action space
    probablity = uniform_policy(hlo_graph)
    # Sample an action
    decisions = argmax_sample(probablity, hlo_graph)
    decisions = np.asarray(decisions)
    # Apply action to the hlo_env
    hlo_env.apply_alternatives(decisions)
    hlo_env.run(general_fusion_pipeline.post_dry_pass_passes)

hlo_env.run(general_fusion_pipeline.post_pass_optimizations)

The hlo_graph is the entry point of all available features. The num_alts is the number of alternatives (i.e. actions) available in the current state. When num_alts is 0, it means there is no more action to choose, and the optimization loop will terminate.

Next, we details how we implement the uniform_policy and argmax_sample functions.

The goal of uniform_policy is to output a probability distribution at each kAlternative node over all its operands (i.e. predecessors in HLO graph). The probability distribution is a tf.RaggedTensor, where the outer dimension is the number of kAlternative nodes, and the inner dimension is the number of operands of each kAlternative node.

def uniform_policy(hlo_graph) -> tf.RaggedTensor:
  """Produce a uniform policy for the given hlo graph.

  Args:
    hlo_graph: the hlo graph

  Returns:
    a tf.RaggedTensor with shape [num_alt_idx, None]. The outer dimension
    is the alternative index, and the inner dimension is the operand index.
    Each row is a list of probability to operand indices for the
    corresponding alternative.
  """
  # get graph structures
  operands, users = get_ragged_tensor_from_hlo(hlo_graph)
  # get the indices of kAlternative nodes
  alternative_idx = tf.convert_to_tensor(hlo_graph.alternative_indices)
  # get the indices of operands for each kAlternative node
  alt_oprnd_idx: tf.RaggedTensor = tf.gather(operands, alternative_idx)

  # assign random score to each operand
  alt_oprnd_prob = tf.map_fn(
    lambda x: tf.random.uniform(shape=x.shape, minval=0, maxval=1),
    alt_oprnd_idx,
    fn_output_signature=tf.RaggedTensorSpec(shape=[None], dtype=tf.float32)
  )

return alt_oprnd_prob

The action space is defined as a 2d-array of dimension [num_alt_idx, 2]. The first column is the index of the kAlternative node, and the second column is the index of the operand to choose.

To output an action, we implement the argmax_sample to choose the operand with the highest score for each kAlternative node.

def argmax_sample(probability: tf.RaggedTensor, hlo_graph) -> tf.Tensor:
  """Select the operand with the highest score for each alternative.

  Args:
    probability: a tf.RaggedTensor with shape [num_alt_idx, None].
      The outer dimension is the alternative index, and the inner
      dimension is the operand index.

    hlo_graph: the hlo graph

  Returns:
    a tf.Tensor with shape [num_alt_idx, 2], the 1st column is
    the uids of alt_idx, the 2nd column is the operand_idx to be selected.
  """
  alt_uids = hlo_graph.node_features.uids[hlo_graph.alternative_indices]

  alt_uids = tf.convert_to_tensor(alt_uids, dtype=tf.int64)

  alt_choice = tf.map_fn(
    lambda x: tf.argmax(x, axis=0),
    probability,
    fn_output_signature=tf.TensorSpec(shape=[], dtype=tf.int64)
  )

  return tf.stack([alt_uids, alt_choice], axis=1)

The full-size code can be found here.

Other Features

  • Saving and Loading HLO module

At any stage of the optimization pipeline, we can export the current Hlo text to a string object for inspection.

init_hlo_str = hlo_env.export_hlo_to_str()

We can also save the snapshot of an HloEnv object at any stage and restore at a later stage.

saved_hlo_module = hlo_env.save_hlo()
hlo_env.pre_fusion_optimizations()
post_fusion_hlo_str = hlo_env.export_hlo_to_str()
hlo_env.load_hlo(saved_hlo_module)

This can be useful when you want to explore different optimization actions from the same initial state.

  • DAG Hash

The existing hash implementation in XLA is lacking in two ways which increase the number of hash collisions: 1) It simply hashes the instructions in the HLO graph in post-order, and does not recursively consider the structure and connections of each HLO instruction and computation in the HLO graph; 2) Instruction specific parameters (e.g. the size and stride of an HLO Convolution instruction) are not considered in the hash of each instruction as well. Our custom HloDAGHash function builds upon XLA’s hash implementation, but is designed to be a more powerful hash that additionally accounts for graph topology and the parameters unique to each instruction. This reduces the chance of a hash collision when determining if a graph has been seen before, or is identical to another graph.

hlo_hash = hlo_env.get_hlo_module_hash()

This is useful for de-duplicating the dataset or uniquely labeling the state when performing a search over the state space.

  • Profiling an HLO Graph

To profile the runtime of an HLO graph we need to obtain both the executable and parameters. We obtain the executable by calling the standard compiler provided by XLA while setting run_backend_only to prevent the reinvocation of HLO passes. For parameters, we randomly generate N(0, 1) for floating-point parameters and fill const values for other types. A fixed random seed is used to keep the parameters consistent across the optimization process so that we can verify the correctness of optimizations. The only parameter for evaluate() is the repeated evaluation time.

num_eval_iterations = 100
eval_result = hlo_env.evaluate(num_eval_iterations)

The above code will run the evaluation for 100 times and generate several metrics and output as shown below:

Evaluation Results
Name Description
durations The default duration in nanoseconds. This returns the execution duration as measured within the Tensorflow evaluation code, starting from the point when the executable has been enqueued on the compute stream till the completion of the executable.
compute_durations The duration in nanoseconds of the computation, without data transfer, as measured on the device.
full_durations The full duration of the computation as measured within HloEnv.evaluate(). This captures the entire execution process including processes such as enqueueing the computation on the compute stream, and is hence more subject to timing noise.
output The output of the HloModule.