Class MergeJoin

  extended by org.pentaho.di.trans.step.BaseStep
      extended by org.pentaho.di.trans.steps.mergejoin.MergeJoin
All Implemented Interfaces:
HasLogChannelInterface, LoggingObjectInterface, VariableSpace, StepInterface

public class MergeJoin
extends BaseStep
implements StepInterface

Merge rows from 2 sorted streams and output joined rows with matched key fields. Use this instead of hash join is both your input streams are too big to fit in memory. Note that both the inputs must be sorted on the join key. This is a first prototype implementation that only handles two streams and inner join. It also always outputs all values from both streams. Ideally, we should: 1) Support any number of incoming streams 2) Allow user to choose the join type (inner, outer) for each stream 3) Allow user to choose which fields to push to next step 4) Have multiple output ports as follows: a) Containing matched records b) Unmatched records for each input port 5) Support incoming rows to be sorted either on ascending or descending order. The currently implementation only supports ascending


Field Summary
Fields inherited from class org.pentaho.di.trans.step.BaseStep
first, linesInput, linesOutput, linesRead, linesRejected, linesSkipped, linesUpdated, linesWritten, terminator, terminator_rows
Constructor Summary
MergeJoin(StepMeta stepMeta, StepDataInterface stepDataInterface, int copyNr, TransMeta transMeta, Trans trans)
Method Summary
 boolean init(StepMetaInterface smi, StepDataInterface sdi)
          Initialize and do work where other steps need to wait for...
 boolean processRow(StepMetaInterface smi, StepDataInterface sdi)
          Perform the equivalent of processing one row.
Methods inherited from class org.pentaho.di.trans.step.BaseStep
addResultFile, addRowListener, addStepListener, batchComplete, buildLog, canProcessOneRow, cleanup, closeQuietly, copyVariablesFrom, decrementLinesRead, decrementLinesWritten, dispatch, dispose, environmentSubstitute, environmentSubstitute, findInputRowSet, findInputRowSet, findOutputRowSet, findOutputRowSet, getBooleanValueOfVariable, getClusterSize, getContainerObjectId, getCopy, getDispatcher, getErrorRowMeta, getErrors, getFilename, getInputRowMeta, getInputRowSets, getLinesInput, getLinesOutput, getLinesRead, getLinesRejected, getLinesSkipped, getLinesUpdated, getLinesWritten, getLogChannel, getLogChannelId, getLogFields, getLogLevel, getNextClassNr, getObjectCopy, getObjectId, getObjectName, getObjectRevision, getObjectType, getOutputRowSets, getParent, getParentVariableSpace, getPartitionID, getPartitionTargets, getPreviewRowMeta, getProcessed, getRegistrationDate, getRemoteInputSteps, getRemoteOutputSteps, getRepartitioning, getRepositoryDirectory, getResultFiles, getRow, getRowFrom, getRowListeners, getRuntime, getServerSockets, getSlaveNr, getSocketRepository, getStatus, getStatusDescription, getStepDataInterface, getStepID, getStepListeners, getStepMeta, getStepMetaInterface, getStepname, getTrans, getTransMeta, getTypeId, getUniqueStepCountAcrossSlaves, getUniqueStepNrAcrossSlaves, getVariable, getVariable, identifyErrorOutput, incrementLinesInput, incrementLinesOutput, incrementLinesRead, incrementLinesRejected, incrementLinesSkipped, incrementLinesUpdated, incrementLinesWritten, initBeforeStart, initializeVariablesFrom, injectVariables, isBasic, isDebug, isDetailed, isDistributed, isInitialising, isMapping, isPartitioned, isPaused, isRowLevel, isRunning, isStopped, isUsingThreadPriorityManagment, listVariables, logBasic, logBasic, logDebug, logDebug, logDetailed, logDetailed, logError, logError, logError, logMinimal, logMinimal, logRowlevel, logRowlevel, logSummary, markStart, markStop, outputIsDone, pauseRunning, putError, putRow, putRowTo, removeRowListener, resumeRunning, rowsetInputSize, rowsetOutputSize, safeModeChecking, setCarteObjectId, setCopy, setDistributed, setErrorRowMeta, setErrors, setInputRowMeta, setInputRowSets, setInternalVariables, setLinesInput, setLinesOutput, setLinesRead, setLinesRejected, setLinesSkipped, setLinesUpdated, setLinesWritten, setLogLevel, setOutputDone, setOutputRowSets, setParentVariableSpace, setPartitioned, setPartitionID, setPartitionTargets, setPaused, setPaused, setPreviewRowMeta, setRepartitioning, setRunning, setServerSockets, setSocketRepository, setStepDataInterface, setStepListeners, setStepMeta, setStepMetaInterface, setStepname, setStopped, setTransMeta, setUsingThreadPriorityManagment, setVariable, shareVariablesWith, stopAll, stopRunning, stopRunning, toString
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.pentaho.di.trans.step.StepInterface
addRowListener, addStepListener, batchComplete, canProcessOneRow, cleanup, dispose, getCopy, getErrors, getInputRowSets, getLinesInput, getLinesOutput, getLinesRead, getLinesRejected, getLinesUpdated, getLinesWritten, getLogChannel, getOutputRowSets, getPartitionID, getProcessed, getResultFiles, getRow, getRowListeners, getRuntime, getStatus, getStepID, getStepMeta, getStepname, getTrans, identifyErrorOutput, initBeforeStart, isMapping, isPartitioned, isPaused, isRunning, isStopped, isUsingThreadPriorityManagment, markStart, markStop, pauseRunning, putRow, removeRowListener, resumeRunning, rowsetInputSize, rowsetOutputSize, setErrors, setLinesRejected, setOutputDone, setPartitioned, setPartitionID, setRepartitioning, setRunning, setStopped, setUsingThreadPriorityManagment, stopAll, stopRunning
Methods inherited from interface org.pentaho.di.core.variables.VariableSpace
copyVariablesFrom, environmentSubstitute, environmentSubstitute, getBooleanValueOfVariable, getParentVariableSpace, getVariable, getVariable, initializeVariablesFrom, injectVariables, listVariables, setParentVariableSpace, setVariable, shareVariablesWith

Constructor Detail


public MergeJoin(StepMeta stepMeta,
                 StepDataInterface stepDataInterface,
                 int copyNr,
                 TransMeta transMeta,
                 Trans trans)
Method Detail


public boolean processRow(StepMetaInterface smi,
                          StepDataInterface sdi)
                   throws KettleException
Description copied from interface: StepInterface
Perform the equivalent of processing one row. Typically this means reading a row from input (getRow()) and passing a row to output (putRow)).

Specified by:
processRow in interface StepInterface
processRow in class BaseStep
smi - The steps metadata to work with
sdi - The steps temporary working data to work with (database connections, result sets, caches, temporary variables, etc.)
false if no more rows can be processed or an error occurred.


public boolean init(StepMetaInterface smi,
                    StepDataInterface sdi)
Description copied from interface: StepInterface
Initialize and do work where other steps need to wait for...

Specified by:
init in interface StepInterface
init in class BaseStep
smi - The metadata to work with
sdi - The data to initialize
See Also:
StepInterface.init( org.pentaho.di.trans.step.StepMetaInterface , org.pentaho.di.trans.step.StepDataInterface)