Class ReservoirSamplingData
java.lang.Object
org.pentaho.di.trans.step.BaseStepData
org.pentaho.di.trans.steps.reservoirsampling.ReservoirSamplingData
- All Implemented Interfaces:
StepDataInterface
Holds temporary data (i.e. sampled rows). Implements the reservoir sampling algorithm "R" by Jeffrey Scott Vitter.
For more information see:
Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March
1985. Pages 37-57.
- Version:
- 1.0
- Author:
- Mark Hall (mhall{[at]}pentaho.org)
-
Nested Class Summary
Nested ClassesNested classes/interfaces inherited from class org.pentaho.di.trans.step.BaseStepData
BaseStepData.StepExecutionStatus -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected intprotected intprotected org.pentaho.di.core.row.RowMetaInterfaceprotected Randomprotected ReservoirSamplingData.PROC_MODE -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidcleanUp()org.pentaho.di.core.row.RowMetaInterfaceGet the output meta dataDetermine the current operational state of the Reservoir Sampling step.Gets the sample as an array of rowsvoidinitialize(int sampleSize, int seed) Initialize this data objectvoidprocessRow(Object[] row) Here is where the action happens.voidsetOutputRowMeta(org.pentaho.di.core.row.RowMetaInterface rmi) Set the meta data for the output formatvoidSet this component to sample, pass through or be disabledMethods inherited from class org.pentaho.di.trans.step.BaseStepData
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, isStopped, setStatusMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.pentaho.di.trans.step.StepDataInterface
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, setStatus
-
Field Details
-
m_outputRowMeta
protected org.pentaho.di.core.row.RowMetaInterface m_outputRowMeta -
m_sample
-
m_k
protected int m_k -
m_currentRow
protected int m_currentRow -
m_random
-
m_state
-
-
Constructor Details
-
ReservoirSamplingData
public ReservoirSamplingData()
-
-
Method Details
-
setOutputRowMeta
public void setOutputRowMeta(org.pentaho.di.core.row.RowMetaInterface rmi) Set the meta data for the output format- Parameters:
rmi- aRowMetaInterfacevalue
-
getOutputRowMeta
public org.pentaho.di.core.row.RowMetaInterface getOutputRowMeta()Get the output meta data- Returns:
- a
RowMetaInterfacevalue
-
getSample
Gets the sample as an array of rows- Returns:
- the sampled rows
-
initialize
public void initialize(int sampleSize, int seed) Initialize this data object- Parameters:
sampleSize- the number of rows to sampleseed- the seed for the random number generator
-
getProcessingMode
Determine the current operational state of the Reservoir Sampling step. Sampling, PassThrough(Do not wait until end, pass through on the fly), Disabled.- Returns:
- current operational state
-
setProcessingMode
Set this component to sample, pass through or be disabled- Parameters:
state- member of PROC_MODE enumeration indicating the desired operational state
-
processRow
Here is where the action happens. Sampling is done using the "R" algorithm of Jeffrey Scott Vitter.- Parameters:
row- an incoming row
-
cleanUp
public void cleanUp()
-