Class ReservoirSamplingData

java.lang.Object
org.pentaho.di.trans.step.BaseStepData
org.pentaho.di.trans.steps.reservoirsampling.ReservoirSamplingData
All Implemented Interfaces:
StepDataInterface

public class ReservoirSamplingData extends BaseStepData implements StepDataInterface
Holds temporary data (i.e. sampled rows). Implements the reservoir sampling algorithm "R" by Jeffrey Scott Vitter.

For more information see:

Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985. Pages 37-57.

Version:
1.0
Author:
Mark Hall (mhall{[at]}pentaho.org)
  • Field Details

    • m_outputRowMeta

      protected org.pentaho.di.core.row.RowMetaInterface m_outputRowMeta
    • m_sample

      protected List<Object[]> m_sample
    • m_k

      protected int m_k
    • m_currentRow

      protected int m_currentRow
    • m_random

      protected Random m_random
    • m_state

  • Constructor Details

    • ReservoirSamplingData

      public ReservoirSamplingData()
  • Method Details

    • setOutputRowMeta

      public void setOutputRowMeta(org.pentaho.di.core.row.RowMetaInterface rmi)
      Set the meta data for the output format
      Parameters:
      rmi - a RowMetaInterface value
    • getOutputRowMeta

      public org.pentaho.di.core.row.RowMetaInterface getOutputRowMeta()
      Get the output meta data
      Returns:
      a RowMetaInterface value
    • getSample

      public List<Object[]> getSample()
      Gets the sample as an array of rows
      Returns:
      the sampled rows
    • initialize

      public void initialize(int sampleSize, int seed)
      Initialize this data object
      Parameters:
      sampleSize - the number of rows to sample
      seed - the seed for the random number generator
    • getProcessingMode

      public ReservoirSamplingData.PROC_MODE getProcessingMode()
      Determine the current operational state of the Reservoir Sampling step. Sampling, PassThrough(Do not wait until end, pass through on the fly), Disabled.
      Returns:
      current operational state
    • setProcessingMode

      public void setProcessingMode(ReservoirSamplingData.PROC_MODE state)
      Set this component to sample, pass through or be disabled
      Parameters:
      state - member of PROC_MODE enumeration indicating the desired operational state
    • processRow

      public void processRow(Object[] row)
      Here is where the action happens. Sampling is done using the "R" algorithm of Jeffrey Scott Vitter.
      Parameters:
      row - an incoming row
    • cleanUp

      public void cleanUp()