Class ReservoirSamplingData

  • All Implemented Interfaces:
    StepDataInterface

    public class ReservoirSamplingData
    extends BaseStepData
    implements StepDataInterface
    Holds temporary data (i.e. sampled rows). Implements the reservoir sampling algorithm "R" by Jeffrey Scott Vitter.

    For more information see:

    Vitter, J. S. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985. Pages 37-57.

    Version:
    1.0
    Author:
    Mark Hall (mhall{[at]}pentaho.org)
    • Field Detail

      • m_outputRowMeta

        protected org.pentaho.di.core.row.RowMetaInterface m_outputRowMeta
      • m_k

        protected int m_k
      • m_currentRow

        protected int m_currentRow
      • m_random

        protected Random m_random
    • Constructor Detail

      • ReservoirSamplingData

        public ReservoirSamplingData()
    • Method Detail

      • setOutputRowMeta

        public void setOutputRowMeta​(org.pentaho.di.core.row.RowMetaInterface rmi)
        Set the meta data for the output format
        Parameters:
        rmi - a RowMetaInterface value
      • getOutputRowMeta

        public org.pentaho.di.core.row.RowMetaInterface getOutputRowMeta()
        Get the output meta data
        Returns:
        a RowMetaInterface value
      • getSample

        public List<Object[]> getSample()
        Gets the sample as an array of rows
        Returns:
        the sampled rows
      • initialize

        public void initialize​(int sampleSize,
                               int seed)
        Initialize this data object
        Parameters:
        sampleSize - the number of rows to sample
        seed - the seed for the random number generator
      • getProcessingMode

        public ReservoirSamplingData.PROC_MODE getProcessingMode()
        Determine the current operational state of the Reservoir Sampling step. Sampling, PassThrough(Do not wait until end, pass through on the fly), Disabled.
        Returns:
        current operational state
      • setProcessingMode

        public void setProcessingMode​(ReservoirSamplingData.PROC_MODE state)
        Set this component to sample, pass through or be disabled
        Parameters:
        state - member of PROC_MODE enumeration indicating the desired operational state
      • processRow

        public void processRow​(Object[] row)
        Here is where the action happens. Sampling is done using the "R" algorithm of Jeffrey Scott Vitter.
        Parameters:
        row - an incoming row
      • cleanUp

        public void cleanUp()