Next: 4 Concept of State
Up: Appnotes Index
Previous:2 Introduction to Graphical Data Flow Description
In the introductory discussions of DF graphs, we have only introduced the processes of data distribution. When the data resides on the same processor that performs the computation, the data is not moved but a pointer to the data is transferred between the consecutive processing nodes. When the processing nodes are separated on multiple processors, the data is actually copied from one processors memory space to the other. The copying process actually takes a finite time to execute. The designer must be aware of this time but usually, in the early stages of the design, the data movement time is assumed to be negligible or easily accommodated by adding an additional processing time to each node's computation time. As the design progresses the actual transfer time is included in the simulation to verify that the system has met its latency specification. In the next section we introduce the set of specialized nodes that are used to copy and replicate the data so the DF graph can express the parallelism in the algorithm.
This block size can be found by equating the block arrival rate to the block processes time as shown in Figure 3-3. In the case where there is no solution for any block size or it is impractical to use the solution block size, then the processing must be parallelized as shown in Figure 3-4. The parallelization procedure begins by inserting a Splitter node which will send the first block of data to node1 and the second block to node2 etc. If the nodes are distributed to separate processors then the time line shown in Figure 3-5 results. For this method of parallelization the timeline shows that the latency that is greater than one block processing time. In this case, the latency can be decreased by using a smaller block size and using more processors and this is shown in Figure 3-6. This figure shows that we have cut the latency by a factor of two by increasing the number of processors by a factor of two. This is possible when the processing time is a linear function of the block size.
The other node that is shown in Figure 3-4 collects the data stream and is called the merge node. This node actually fires when a block of data arrives on one of its indicated input nodes. Strictly speaking, the merge node does not behave as an ordinary DF node. The rule for firing a DF node is that the node fires or computes an output when all the input queues are over threshold. In the case of a merge node, only one of the input nodes will be over threshold at a given time. In practice, the merge node uses an additional queue of control data generated by the split node that indicates which of the input queues should be examined by merge node for the queue over threshold signal. This additional queue is shown in the graph of Figure 3-4. This control node assures us that the merge node will reassemble the data blocks in the same order that they were split apart by the splitter node.
In the SAR example (see SAR Case Study) the split - merge pair was used to distribute the range processing data to each node. The graph also made use of the Family notation so the discussion of this graph will be deferred to Section 3.4.
processing is shown is executed between the split and merge nodes and a Family notation is used for the range processing nodes. The external feed-forward queues are shown that feeds the control queue from the split node (dfc_swth) to the merge node. The addition of the dfc_rep node in this path was used because of the configuration of the dfc_swth node that was used to implement the control feed-forward. Since the azimuth processing could not proceed until a full frame of range data was processed, the azimuth processing uses a Sep and Cat pair. This pair was used to implement the corner turning required by the SAR processing. The processed range data could only be distributed only after the queue following the merge node contained a complete frames worth of data. A Transpose node could have been used to perform the corner turning but in order to perform the azimuth processing in parallel, a Sep node would still have to follow the transpose node. Therefore the transpose operation was performed by the Sep node in addition to the parallelization of the data stream. Again the azimuth processing nodes are represented using family notation.
Because of the setup time for a single block of range processing was large for each individual block, the input blocks of data were grouped together and the range computation node was performed on a group of blocks at one time. The range processing node included a loop for each block of range data even though the inner processing was still performed on a range block by range block basis. Since all the computations were performed on one processor the successive range computations could be pipelined so that the setup time was only present for the first range block and the overall processing time reduced. The additional latency caused by the range computation being performed over several blocks at a time was acceptable. This latency is small compared to the amount inherent in the corner turning operation of the SAR computation. A full frame (range by azimuth block data size) latency exists as a minimum even in addition to the azimuth computation time.
This technique is illustrated by the ETC graph.(see the ETC4ALFS on COTS Processor Case Study for additional information regarding this application). Figure 3-10 and 3-11 shows how the DF was set up and tested for the complete ETC algorithm The data input arrives as a block of range data that has been processed by the matched filter and there are [1.. Num_beams]input streams. The per beam processing represented by the family of rg_w processing is shown in the subgraph of Figure 3-11. In all cases shown in the figure, the identity processing nodes are represented by the Meter nodes.
The family output data streams were then combined as shown using the CAT node. The Processing was then completed by as set of sequential nodes. This DF graph design was tested with sequential data streams and the integrity of the DF design established. By using a performance simulation tool and assigning representative processing time for the processing nodes (meter nodes) a execution time line was developed. This is shown inn figure 3-12.
The Processing was divided onto 8 processors. In this initial design the node processing times were approximated by a simple linear time formula as a function of the block size N. Processing time for a node is tnode = a + b N formula where a and b are processing times and N is the number of data item processed per node firing.
Next: 4 Concept of State
Up: Appnotes Index
Previous:2 Introduction to Graphical Data Flow Description