Details of the Example Transformation Design

Please note there is only one phase and there are four cluster components in the graph (highlighted by red border). These components define a point of change "node allocation", so the part of the graph demarcated by these components is highlighted by the red rectangle. This part of the graph performs data processing in parallel. This means that the components inside the dotted rectangle run on cluster nodes according to the "node allocation" of that part of the graph.

The rest of the graph runs just on one node called "primary worker".

Specification of "node allocation"

Since there is only one phase, the whole graph has just one primary worker and only one node allocation.

  • node allocation is applied for groups of components running in parallel (demarcated by the four cluster components)

  • the outer part of the graph run on a single node - primary worker.

The primary worker is specified by the sandbox code used in the URLs of input data. The following dialog shows the File URL value: "sandbox://data/path-to-csv-file", where "data" is the ID of the server sandbox containing the specified file. And it is the "data" local sandbox which defines the primary worker in the graph.

The part of the graph demarcated by the four cluster components may have specified its allocation by the file URL attribute as well, but this part does not work with files at all, so there is no file URL. Thus, we will use the "allocation" attribute. Since all components in this part must have the same allocation, it is sufficient to set it only for one component.

Again, "dataPartitioned" in the following dialog is the sandbox ID.

Let's investigate our sandboxes. This project requires 3 sandboxes: "data", "dataPartitioned" and "PhoneChargesDistributed".

  • data

    • contains input and output data

    • local sandbox (yellow folder), so it has only one physical location

    • accessible only on node "i-4cc9733b" in the specified path

  • dataPartitioned

    • partitioned sandbox (red folder), so it has a list of physical locations on different nodes

    • does not contain any data and since the graph does not read or write to this sandbox, it is used only for the definition of "nodes allocation"

    • on the following figure, allocation is configured for two cluster nodes

  • PhoneChargesDistributed

    • common sandbox containing the graph file, metadata, and connections

    • shared sandbox (blue folder), so all cluster nodes have access to the same files

If the graph was executed with the sandbox configuration of the previous figure, the node allocation would be:

  • components which run only on primary worker, will run only on the "i-4cc9733b" node according to the "data" sandbox location.

  • components with allocation according to the "dataPartitioned" sandbox will run on nodes "i-4cc9733b" and "i-52d05425".