org.pentaho.di.trans.steps.reservoirsampling
Class ReservoirSamplingData

java.lang.Object
  extended by org.pentaho.di.trans.step.BaseStepData
      extended by org.pentaho.di.trans.steps.reservoirsampling.ReservoirSamplingData
All Implemented Interfaces:
StepDataInterface

public class ReservoirSamplingData
extends BaseStepData
implements StepDataInterface

Holds temporary data (i.e. sampled rows). Implements the reservoir sampling algorithm "R" by Jeffrey Scott Vitter.

For more information see:

Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985. Pages 37-57.

Version:
1.0
Author:
Mark Hall (mhall{[at]}pentaho.org)

Nested Class Summary
static class ReservoirSamplingData.PROC_MODE
           
 
Nested classes/interfaces inherited from class org.pentaho.di.trans.step.BaseStepData
BaseStepData.StepExecutionStatus
 
Constructor Summary
ReservoirSamplingData()
           
 
Method Summary
 void cleanUp()
           
 RowMetaInterface getOutputRowMeta()
          Get the output meta data
 ReservoirSamplingData.PROC_MODE getProcessingMode()
          Determine the current operational state of the Reservoir Sampling step.
 List<Object[]> getSample()
          Gets the sample as an array of rows
 void initialize(int sampleSize, int seed)
          Initialize this data object
 void processRow(Object[] row)
          Here is where the action happens.
 void setOutputRowMeta(RowMetaInterface rmi)
          Set the meta data for the output format
 void setProcessingMode(ReservoirSamplingData.PROC_MODE state)
          Set this component to sample, pass through or be disabled
 
Methods inherited from class org.pentaho.di.trans.step.BaseStepData
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, isStopped, setStatus
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.pentaho.di.trans.step.StepDataInterface
getStatus, isDisposed, isEmpty, isFinished, isIdle, isInitialising, isRunning, setStatus
 

Constructor Detail

ReservoirSamplingData

public ReservoirSamplingData()
Method Detail

setOutputRowMeta

public void setOutputRowMeta(RowMetaInterface rmi)
Set the meta data for the output format

Parameters:
rmi - a RowMetaInterface value

getOutputRowMeta

public RowMetaInterface getOutputRowMeta()
Get the output meta data

Returns:
a RowMetaInterface value

getSample

public List<Object[]> getSample()
Gets the sample as an array of rows

Returns:
the sampled rows

initialize

public void initialize(int sampleSize,
                       int seed)
Initialize this data object

Parameters:
sampleSize - the number of rows to sample
seed - the seed for the random number generator

getProcessingMode

public ReservoirSamplingData.PROC_MODE getProcessingMode()
Determine the current operational state of the Reservoir Sampling step. Sampling, PassThrough(Do not wait until end, pass through on the fly), Disabled.

Returns:
current operational state

setProcessingMode

public void setProcessingMode(ReservoirSamplingData.PROC_MODE state)
Set this component to sample, pass through or be disabled

Parameters:
state - member of PROC_MODE enumeration indicating the desired operational state

processRow

public void processRow(Object[] row)
Here is where the action happens. Sampling is done using the "R" algorithm of Jeffrey Scott Vitter.

Parameters:
row - an incoming row

cleanUp

public void cleanUp()