Class ReservoirSamplingData
java.lang.Object
org.pentaho.di.trans.step.BaseStepData
org.pentaho.di.trans.steps.reservoirsampling.ReservoirSamplingData
- All Implemented Interfaces:
StepDataInterface
Holds temporary data (i.e. sampled rows). Implements the reservoir sampling algorithm "R" by Jeffrey Scott Vitter.
For more information see:
Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March
1985. Pages 37-57.
- Version:
- 1.0
- Author:
- Mark Hall (mhall{[at]}pentaho.org)
-
Nested Class Summary
Nested classes/interfaces inherited from class org.pentaho.di.trans.step.BaseStepData
BaseStepData.StepExecutionStatus
-
Field Summary
Modifier and TypeFieldDescriptionprotected int
protected int
protected org.pentaho.di.core.row.RowMetaInterface
protected Random
protected ReservoirSamplingData.PROC_MODE
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
cleanUp()
org.pentaho.di.core.row.RowMetaInterface
Get the output meta dataDetermine the current operational state of the Reservoir Sampling step.Gets the sample as an array of rowsvoid
initialize
(int sampleSize, int seed) Initialize this data objectvoid
processRow
(Object[] row) Here is where the action happens.void
setOutputRowMeta
(org.pentaho.di.core.row.RowMetaInterface rmi) Set the meta data for the output formatvoid
Set this component to sample, pass through or be disabledMethods inherited from class org.pentaho.di.trans.step.BaseStepData
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, isStopped, setStatus
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.pentaho.di.trans.step.StepDataInterface
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, setStatus
-
Field Details
-
m_outputRowMeta
protected org.pentaho.di.core.row.RowMetaInterface m_outputRowMeta -
m_sample
-
m_k
protected int m_k -
m_currentRow
protected int m_currentRow -
m_random
-
m_state
-
-
Constructor Details
-
ReservoirSamplingData
public ReservoirSamplingData()
-
-
Method Details
-
setOutputRowMeta
public void setOutputRowMeta(org.pentaho.di.core.row.RowMetaInterface rmi) Set the meta data for the output format- Parameters:
rmi
- aRowMetaInterface
value
-
getOutputRowMeta
public org.pentaho.di.core.row.RowMetaInterface getOutputRowMeta()Get the output meta data- Returns:
- a
RowMetaInterface
value
-
getSample
Gets the sample as an array of rows- Returns:
- the sampled rows
-
initialize
public void initialize(int sampleSize, int seed) Initialize this data object- Parameters:
sampleSize
- the number of rows to sampleseed
- the seed for the random number generator
-
getProcessingMode
Determine the current operational state of the Reservoir Sampling step. Sampling, PassThrough(Do not wait until end, pass through on the fly), Disabled.- Returns:
- current operational state
-
setProcessingMode
Set this component to sample, pass through or be disabled- Parameters:
state
- member of PROC_MODE enumeration indicating the desired operational state
-
processRow
Here is where the action happens. Sampling is done using the "R" algorithm of Jeffrey Scott Vitter.- Parameters:
row
- an incoming row
-
cleanUp
public void cleanUp()
-