.. _examples: Examples Using HloEnv ======================== This documentation only covers the Python interface. We will walk through two simple examples that make use of several HloEnv features. We first show how to read in an HLO text file and turn it into HLO graph features which can be used for implementing a user-defined decision-making agent. We then present a very simple decision-making agent where the policy randomly choose from available actions. Playing with HLO Graph Features ------------------------------- First make sure your current working directory is correct. .. code-block:: bash $ cd /path/to/hloenv/examples The HloEnv module holds most functionality, so we usually import it first. .. code-block:: python import os import pathlib import numpy as np from hloenv import HloEnv Pick one hlo text file that we want to take a closer look. .. code-block:: python hlo_path = os.path.join( pathlib.Path(__file__).parent.absolute(), "hlo_texts/jax-md/module_0013.jit__lambda_.7.before_optimizations.txt" ) Now we are ready to create a basic hlo env object on GPU backend. We haven't worked on other ML hardwares so current GPU is the only backend HloEnv supports. .. code-block:: python hlo_env = HloEnv(hlo_path, "gpu") The hlo env can automatically extract features from hlo text files and organize them into another class HloGraph. .. code-block:: python hlo_graph = hlo_env.get_hlo_graph() To make the features more array programming friendly, all the graph features in HloGraph are organized in the form of `CSR `_. There are three types of features: global graph feature, node feature, and in/out edge feature, all serve as the accessible members of HloGraph object. Details are in the following table: .. list-table:: **Global Graph Features** :widths: 42 42 :header-rows: 1 * - Feature Name - Description * - out_edge_offsets - The offset index to the actual out edge node ID indices array * - out_edge_indices - The out edge node ID indices array * - in_edge_offsets - The offset index to the actual in edge node ID indices array * - in_edge_indices - The in edge node ID indices array * - alternative_indices - The indices to all the *kAlternative* nodes * - opcode_attr_counts - Number of attributes in HLO opcode All edge features are vectors of the length of number of edges in the HLO graph. In/Out edge features share the same feature set as follows. .. list-table:: **In and Out Edge Features** :widths: 42 42 :header-rows: 1 * - Feature Name - Description * - uids - Unique ID of the edge, a concatination of source and destination nodes uids * - srcs - Node index of source node * - dsts - Node index of destination node * - dims - Dimension of the tensor flows by this edge * - layout - Layout of the tensor flows by this edge * - lehmercodes - The `Lehmer code `_ (a better embedding) of the tensor layout * - types - Edge type is one of the following: outside any fusion, inside fusion, and cross fusion * - dtypes - Data type of the tensor flows by this edge All node features are vectors of the length of number of HloInstructions (nodes) in the HloModule (HloGraph). .. list-table:: **Node Features** :widths: 42 42 :header-rows: 1 * - Feature Name - Description * - uids - Unique ID of an HloInstruction * - gids - Sub computation ID the HloInstruction belongs to, 0 means in main computation. * - normalized_num_group_inst - If an HloInstruction is inside a sub-computation, normalized_num_group_inst is the reciprocal of the total number of instructions in a sub-computation. This can serve as a weighting parameter for an instruction's impact * - num_users - Number of HloInstructions that uses the result of this HloInstruction * - num_operands - Number of HloInstructions whose results this HloInstruction uses * - opcodes - HLO opcode index, as defined `here `__ * - opcode_attrs - Unique attribute embeddings for each opcode * - num_opcode_attrs - List of pairs, each pair contains the number of integer attribute and the number of enum attribute in opcode_attrs * - is_alternative - List of boolean that shows if the HloInstruction is *kAlternative* * - is_in_fusion - List of boolean that shows if the HloInstruction is inside a fused computation * - in_tensor_sizes - The total input tensor size from all operands of this HloInstruction * - out_tensor_sizes - The output tensor size of this HloInstruction * - has_max_in_tensor - List of boolean that shows if one of the operands contains the max input tensor size * - has_max_out_tensor - List of boolean that shows if the output tensor size has the maximum size * - names - List of strings that shows the names of the HloInstruction The full-size code can be found `here `__. In our second example, we will show you how to use these features to create a simple decision-making agent and run XLA optimizations using it. Defining a Custom Pipeline -------------------------- All of the passes that were initially defined in the full XLA optimization pipeline for compilation for GPU have been included in :class:`~hloenv.HloEnv`. A list of all of these passes can be found in :ref:`xla_passes`. A pass can be created by wrapping a :class:`~hloenv.HloPass` object within a :class:`~hloenv.Pass` object. This Pass can then be run within the :class:`~hloenv.HloEnv` to modify the :class:`~hloenv.HloModule` object loaded in the HloEnv. .. code-block:: python hlo_env = HloEnv("path/to/hlo.txt", "gpu") fusion_pass = Pass(HloPass.GpuInstructionFusion(True)) hlo_env.run(fusion_pass) Passes can be organized into :class:`~hloenv.Pipeline` objects, which can also be run by the HloEnv. .. code-block:: python op_splitter_pass = Pass(HloPass.VariadicOpSplitter()) fusion_pipeline = Pipeline("fusion-pre") fusion_pipeline.add_pass(op_splitter_pass) # You can also add a HloPass directly without first wrapping it in Pass fusion_pipeline.add_pass(HloPass.GpuInstructionFusion(True)) hlo_env.run(fusion_pipeline) A Pipeline can contain other Pipelines and Passes recursively, for example: .. code-block:: python fusion_pipeline_pre = Pipeline("fusion-pre") fusion_pipeline_pre.add_pass(HloPass.VariadicOpSplitter()) self.fusion_pipeline_full = Pipeline("fusion") self.fusion_pipeline_full.add(fusion_pipeline_pre) self.fusion_pipeline_full.add(HloPass.GpuInstructionFusion(True)) To convert a pass into a `dry pass` we wrap it in an :class:`~hloenv.AltPipeline`. All rewrites performed by an AltPiplines are captures and converted into alternatives. If you wrap multiple passes or pipelines with this AltPipeline wrapper, `all` modifcations performed by each of those passes/pipelines will be in dry mode. .. code-block:: python # Wrapping a single pass in an AltPipeline fusion_dry_pass = AltPipeline( Pass( HloPass.GpuInstructionFusion(True), # may_duplicate ) ) fusion_pipeline_post = Pipeline("fusion-post") fusion_pipeline_post.add_pass(HloPass.FusionMerger()) fusion_pipeline_post.add_pass(HloPass.GpuMultiOutputFusion()) fusion_pipeline_post.add_pass(HloPass.HloCSE(True, True)) # Wrapping multiple passes in an AltPipeline fusion_pipeline_post_dry = AltPipeline(fusion_pipeline_post) # Adding a pass directly to an AltPipeline fusion_pipeline_post_dry.add_pass(HloPass.HloDCE()) A sample General Fusion Pipeline can be found in `examples/general_fusion_pipeline.py` which contains the full XLA optimization pipeline, except we replace the vertical fusion pipeline with our custom General Fusion pass. A Simple Decision-making Agent ------------------------------ We here present a very simple decision-making agent that randomly chooses from available actions in an optimization loop. The loop will isolate out the graph rewrite in an XLA pass, and layout the decisions to choose. At a high level, the optimization loop follows these steps: * run `pre_pass_optimizations` * enter optimization loop * run `pre_dry_pass_passes` * open `pass_dry_run` * choose an action * apply the action * run `post_dry_pass_passes` * run `post_pass_optimizations` We can regard the `pre_pass_optimizations` as the `pre-processing` stage and `post_pass_optimizations` as the `post-processing` stage. So they are not included in the optimization loop. Every step of `pass_dry_run` will expose the alternatives (i.e. action space) to users. Note that it is also surrounded by `pre_dry_pass_passes` and `post_dry_pass_passes` for some pre/post processing. They are included in the optimization loop. Here we are interested in `GeneralFusion` pipeline. All the above described steps are implemented and scheduled in the `GeneralFusionPipeline` class, which is a sample pipeline that we have provided in examples/general_fusion_pipeline.py .. code-block:: python from general_fusion_pipeline import GeneralFusionPipeline from hloenv import AltPipeline, HloEnv, HloPass, Pass, Pipeline hlo_env = HloEnv(hlo_path, "gpu") general_fusion_pipeline = GeneralFusionPipeline(hlo_env) The code of the optimization loop looks like this: .. code-block:: python hlo_env.run(general_fusion_pipeline.pre_pass_optimizations) num_alts = 1 while num_alts > 0: hlo_env.run(general_fusion_pipeline.pre_dry_pass_passes) # Open up the action space hlo_env.run(general_fusion_pipeline.pass_dry_run) # Get features from hlo_env hlo_graph = hlo_env.get_hlo_graph(do_hash_verification=False) num_alts = len(hlo_graph.alternative_indices) if num_alts > 0: # Obtain a probablity distribution over the action space probablity = uniform_policy(hlo_graph) # Sample an action decisions = argmax_sample(probablity, hlo_graph) decisions = np.asarray(decisions) # Apply action to the hlo_env hlo_env.apply_alternatives(decisions) hlo_env.run(general_fusion_pipeline.post_dry_pass_passes) hlo_env.run(general_fusion_pipeline.post_pass_optimizations) The `hlo_graph` is the entry point of all available features. The `num_alts` is the number of alternatives (i.e. actions) available in the current state. When `num_alts` is 0, it means there is no more action to choose, and the optimization loop will terminate. Next, we details how we implement the `uniform_policy` and `argmax_sample` functions. The goal of `uniform_policy` is to output a probability distribution at each kAlternative node over all its operands (i.e. predecessors in HLO graph). The probability distribution is a tf.RaggedTensor, where the outer dimension is the number of kAlternative nodes, and the inner dimension is the number of operands of each kAlternative node. .. code-block:: python def uniform_policy(hlo_graph) -> tf.RaggedTensor: """Produce a uniform policy for the given hlo graph. Args: hlo_graph: the hlo graph Returns: a tf.RaggedTensor with shape [num_alt_idx, None]. The outer dimension is the alternative index, and the inner dimension is the operand index. Each row is a list of probability to operand indices for the corresponding alternative. """ # get graph structures operands, users = get_ragged_tensor_from_hlo(hlo_graph) # get the indices of kAlternative nodes alternative_idx = tf.convert_to_tensor(hlo_graph.alternative_indices) # get the indices of operands for each kAlternative node alt_oprnd_idx: tf.RaggedTensor = tf.gather(operands, alternative_idx) # assign random score to each operand alt_oprnd_prob = tf.map_fn( lambda x: tf.random.uniform(shape=x.shape, minval=0, maxval=1), alt_oprnd_idx, fn_output_signature=tf.RaggedTensorSpec(shape=[None], dtype=tf.float32) ) return alt_oprnd_prob The action space is defined as a 2d-array of dimension [num_alt_idx, 2]. The first column is the index of the kAlternative node, and the second column is the index of the operand to choose. To output an action, we implement the `argmax_sample` to choose the operand with the highest score for each kAlternative node. .. code-block:: python def argmax_sample(probability: tf.RaggedTensor, hlo_graph) -> tf.Tensor: """Select the operand with the highest score for each alternative. Args: probability: a tf.RaggedTensor with shape [num_alt_idx, None]. The outer dimension is the alternative index, and the inner dimension is the operand index. hlo_graph: the hlo graph Returns: a tf.Tensor with shape [num_alt_idx, 2], the 1st column is the uids of alt_idx, the 2nd column is the operand_idx to be selected. """ alt_uids = hlo_graph.node_features.uids[hlo_graph.alternative_indices] alt_uids = tf.convert_to_tensor(alt_uids, dtype=tf.int64) alt_choice = tf.map_fn( lambda x: tf.argmax(x, axis=0), probability, fn_output_signature=tf.TensorSpec(shape=[], dtype=tf.int64) ) return tf.stack([alt_uids, alt_choice], axis=1) The full-size code can be found `here `__. Other Features -------------- - Saving and Loading HLO module At any stage of the optimization pipeline, we can export the current Hlo text to a string object for inspection. .. code-block:: python init_hlo_str = hlo_env.export_hlo_to_str() We can also save the snapshot of an HloEnv object at any stage and restore at a later stage. .. code-block:: python saved_hlo_module = hlo_env.save_hlo() hlo_env.pre_fusion_optimizations() post_fusion_hlo_str = hlo_env.export_hlo_to_str() hlo_env.load_hlo(saved_hlo_module) This can be useful when you want to explore different optimization actions from the same initial state. - DAG Hash The existing hash implementation in XLA is lacking in two ways which increase the number of hash collisions: 1) It simply hashes the instructions in the HLO graph in post-order, and does not recursively consider the structure and connections of each HLO instruction and computation in the HLO graph; 2) Instruction specific parameters (e.g. the size and stride of an HLO Convolution instruction) are not considered in the hash of each instruction as well. Our custom HloDAGHash function builds upon XLA’s hash implementation, but is designed to be a more powerful hash that additionally accounts for graph topology and the parameters unique to each instruction. This reduces the chance of a hash collision when determining if a graph has been seen before, or is identical to another graph. .. code-block:: python hlo_hash = hlo_env.get_hlo_module_hash() This is useful for de-duplicating the dataset or uniquely labeling the state when performing a search over the state space. - Profiling an HLO Graph To profile the runtime of an HLO graph we need to obtain both the executable and parameters. We obtain the executable by calling the standard compiler provided by XLA while setting *run_backend_only* to prevent the reinvocation of HLO passes. For parameters, we randomly generate N(0, 1) for floating-point parameters and fill const values for other types. A fixed random seed is used to keep the parameters consistent across the optimization process so that we can verify the correctness of optimizations. The only parameter for evaluate() is the repeated evaluation time. .. code-block:: python num_eval_iterations = 100 eval_result = hlo_env.evaluate(num_eval_iterations) The above code will run the evaluation for 100 times and generate several metrics and output as shown below: .. list-table:: **Evaluation Results** :widths: 42 42 :header-rows: 1 * - Name - Description * - durations - The default duration in nanoseconds. This returns the execution duration as measured within the Tensorflow evaluation code, starting from the point when the executable has been enqueued on the compute stream till the completion of the executable. * - compute_durations - The duration in nanoseconds of the computation, without data transfer, as measured on the device. * - full_durations - The full duration of the computation as measured within HloEnv.evaluate(). This captures the entire execution process including processes such as enqueueing the computation on the compute stream, and is hence more subject to timing noise. * - output - The output of the HloModule.