Class MergeJoin
java.lang.Object
org.pentaho.di.trans.step.BaseStep
org.pentaho.di.trans.steps.mergejoin.MergeJoin
- All Implemented Interfaces:
org.pentaho.di.core.ExtensionDataInterface,HasLogChannelInterface,org.pentaho.di.core.logging.LoggingObjectInterface,org.pentaho.di.core.logging.LoggingObjectLifecycleInterface,org.pentaho.di.core.variables.VariableSpace,StepInterface
Merge rows from 2 sorted streams and output joined rows with matched key fields. Use this instead of hash join is
both your input streams are too big to fit in memory. Note that both the inputs must be sorted on the join key.
This is a first prototype implementation that only handles two streams and inner join. It also always outputs all
values from both streams. Ideally, we should: 1) Support any number of incoming streams 2) Allow user to choose the
join type (inner, outer) for each stream 3) Allow user to choose which fields to push to next step 4) Have multiple
output ports as follows: a) Containing matched records b) Unmatched records for each input port 5) Support incoming
rows to be sorted either on ascending or descending order. The currently implementation only supports ascending
- Since:
- 24-nov-2006
- Author:
- Biswapesh
-
Field Summary
Fields inherited from class org.pentaho.di.trans.step.BaseStep
deadLockCounter, extensionDataMap, first, linesInput, linesOutput, linesRead, linesRejected, linesSkipped, linesUpdated, linesWritten, log, metaStore, repository, rowListeners, safeStopped, terminator, terminator_rows, variables -
Constructor Summary
ConstructorsConstructorDescriptionMergeJoin(StepMeta stepMeta, StepDataInterface stepDataInterface, int copyNr, TransMeta transMeta, Trans trans) -
Method Summary
Modifier and TypeMethodDescriptionbooleaninit(StepMetaInterface smi, StepDataInterface sdi) Initialize and do work where other steps need to wait for...protected booleanisInputLayoutValid(org.pentaho.di.core.row.RowMetaInterface row1, org.pentaho.di.core.row.RowMetaInterface row2) Checks whether incoming rows are join compatible.booleanprocessRow(StepMetaInterface smi, StepDataInterface sdi) Perform the equivalent of processing one row.Methods inherited from class org.pentaho.di.trans.step.BaseStep
addResultFile, addRowListener, addRowSetToInputRowSets, addRowSetToOutputRowSets, addStepListener, batchComplete, beforeStartProcessing, buildLog, canProcessOneRow, checkFeedback, cleanup, clearInputRowSets, clearOutputRowSets, closeQuietly, copyVariablesFrom, decrementLinesRead, decrementLinesWritten, dispatch, dispose, environmentSubstitute, environmentSubstitute, environmentSubstitute, fieldSubstitute, findInputRowSet, findInputRowSet, findOutputRowSet, findOutputRowSet, getBooleanValueOfVariable, getClusterSize, getContainerObjectId, getCopy, getCurrentInputRowSetNr, getCurrentOutputRowSetNr, getDispatcher, getErrorRowMeta, getErrors, getExtensionDataMap, getFilename, getFirstInputRowSet, getInputRowMeta, getInputRowSets, getLinesInput, getLinesOutput, getLinesRead, getLinesRejected, getLinesSkipped, getLinesUpdated, getLinesWritten, getLogChannel, getLogChannelId, getLogFields, getLogLevel, getMetaStore, getNextClassNr, getObjectCopy, getObjectId, getObjectName, getObjectRevision, getObjectType, getOutputRowSets, getParent, getParentVariableSpace, getPartitionID, getPartitionTargets, getPreviewRowMeta, getProcessed, getRegistrationDate, getRemoteInputSteps, getRemoteOutputSteps, getRepartitioning, getRepository, getRepositoryDirectory, getResultFiles, getRow, getRowFrom, getRowHandler, getRowListeners, getRuntime, getServerSockets, getSlaveNr, getSocketRepository, getStatus, getStatusDescription, getStepDataInterface, getStepID, getStepListeners, getStepMeta, getStepMetaInterface, getStepname, getTrans, getTransMeta, getTypeId, getUniqueStepCountAcrossSlaves, getUniqueStepNrAcrossSlaves, getVariable, getVariable, handleGetRowFrom, handlePutRowTo, identifyErrorOutput, incrementLinesInput, incrementLinesOutput, incrementLinesRead, incrementLinesRejected, incrementLinesSkipped, incrementLinesUpdated, incrementLinesWritten, initBeforeStart, initializeVariablesFrom, injectVariables, isBasic, isDebug, isDetailed, isDistributed, isForcingSeparateLogging, isGatheringMetrics, isInitialising, isMapping, isPartitioned, isPaused, isRowLevel, isRunning, isSafeStopped, isStopped, isUsingThreadPriorityManagment, listVariables, logBasic, logBasic, logDebug, logDebug, logDetailed, logDetailed, logError, logError, logError, logMinimal, logMinimal, logRowlevel, logRowlevel, logSummary, markStart, markStop, openRemoteInputStepSocketsOnce, openRemoteOutputStepSocketsOnce, outputIsDone, pauseRunning, putError, putRow, putRowTo, removeRowListener, resumeRunning, rowsetInputSize, rowsetOutputSize, safeModeChecking, safeModeChecking, setCarteObjectId, setCopy, setCurrentInputRowSetNr, setCurrentOutputRowSetNr, setDistributed, setErrorRowMeta, setErrors, setForcingSeparateLogging, setGatheringMetrics, setInputRowMeta, setInputRowSets, setInternalVariables, setLinesInput, setLinesOutput, setLinesRead, setLinesRejected, setLinesSkipped, setLinesUpdated, setLinesWritten, setLogLevel, setMetaStore, setOutputDone, setOutputRowSets, setParentVariableSpace, setPartitioned, setPartitionID, setPartitionTargets, setPaused, setPaused, setPreviewRowMeta, setRepartitioning, setRepository, setRowHandler, setRunning, setSafeStopped, setServerSockets, setSocketRepository, setStepDataInterface, setStepListeners, setStepMeta, setStepMetaInterface, setStepname, setStopped, setTransMeta, setUsingThreadPriorityManagment, setVariable, shareVariablesWith, stopAll, stopRunning, stopRunning, swapFirstInputRowSetIfExists, toString, verifyInputDeadLock, waitUntilTransformationIsStartedMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface org.pentaho.di.core.logging.LoggingObjectLifecycleInterface
callAfterLog, callBeforeLogMethods inherited from interface org.pentaho.di.trans.step.StepInterface
addRowListener, addRowSetToInputRowSets, addRowSetToOutputRowSets, addStepListener, afterFinishProcessing, batchComplete, beforeStartProcessing, canProcessOneRow, cleanup, dispose, getCopy, getCurrentInputRowSetNr, getCurrentOutputRowSetNr, getErrors, getInputRowSets, getLinesInput, getLinesOutput, getLinesRead, getLinesRejected, getLinesUpdated, getLinesWritten, getLogChannel, getMetaStore, getOutputRowSets, getPartitionID, getProcessed, getRepository, getResultFiles, getRow, getRowListeners, getRuntime, getStatus, getStepID, getStepMeta, getStepname, getTrans, identifyErrorOutput, initBeforeStart, isMapping, isPartitioned, isPaused, isRunning, isSafeStopped, isStopped, isUsingThreadPriorityManagment, markStart, markStop, pauseRunning, putRow, removeRowListener, resumeRunning, rowsetInputSize, rowsetOutputSize, setCurrentInputRowSetNr, setCurrentOutputRowSetNr, setErrors, setLinesRejected, setMetaStore, setOutputDone, setPartitioned, setPartitionID, setRepartitioning, setRepository, setRunning, setSafeStopped, setStopped, setUsingThreadPriorityManagment, stopAll, stopRunning, subStatusesMethods inherited from interface org.pentaho.di.core.variables.VariableSpace
copyVariablesFrom, environmentSubstitute, environmentSubstitute, environmentSubstitute, fieldSubstitute, getBooleanValueOfVariable, getParentVariableSpace, getVariable, getVariable, initializeVariablesFrom, injectVariables, listVariables, setParentVariableSpace, setVariable, shareVariablesWith
-
Constructor Details
-
MergeJoin
public MergeJoin(StepMeta stepMeta, StepDataInterface stepDataInterface, int copyNr, TransMeta transMeta, Trans trans)
-
-
Method Details
-
processRow
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws org.pentaho.di.core.exception.KettleException Description copied from interface:StepInterfacePerform the equivalent of processing one row. Typically this means reading a row from input (getRow()) and passing a row to output (putRow)).- Specified by:
processRowin interfaceStepInterface- Overrides:
processRowin classBaseStep- Parameters:
smi- The steps metadata to work withsdi- The steps temporary working data to work with (database connections, result sets, caches, temporary variables, etc.)- Returns:
- false if no more rows can be processed or an error occurred.
- Throws:
org.pentaho.di.core.exception.KettleException
-
init
Description copied from interface:StepInterfaceInitialize and do work where other steps need to wait for...- Specified by:
initin interfaceStepInterface- Overrides:
initin classBaseStep- Parameters:
smi- The metadata to work withsdi- The data to initialize- See Also:
-
isInputLayoutValid
protected boolean isInputLayoutValid(org.pentaho.di.core.row.RowMetaInterface row1, org.pentaho.di.core.row.RowMetaInterface row2) Checks whether incoming rows are join compatible. This essentially means that the keys being compared should be of the same datatype and both rows should have the same number of keys specified- Parameters:
row1- Reference rowrow2- Row to compare to- Returns:
- true when templates are compatible.
-