apache_beam.runners.interactive.caching.streaming_cache module
- class apache_beam.runners.interactive.caching.streaming_cache.StreamingCacheSink(cache_dir, filename, sample_resolution_sec, coder=SafeFastPrimitivesCoder)[source]
Bases:
PTransform
A PTransform that writes TestStreamFile(Header|Records)s to file.
This transform takes in an arbitrary element stream and writes the list of TestStream events (as TestStreamFileRecords) to file. When replayed, this will produce the best-effort replay of the original job (e.g. some elements may be produced slightly out of order from the original stream).
Note that this PTransform is assumed to be only run on a single machine where the following assumptions are correct: elements come in ordered, no two transforms are writing to the same file. This PTransform is assumed to only run correctly with the DirectRunner.
TODO(https://github.com/apache/beam/issues/20002): Generalize this to more source/sink types aside from file based. Also, generalize to cases where there might be multiple workers writing to the same sink.
- property path
Returns the path the sink leads to.
- property size_in_bytes
Returns the space usage in bytes of the sink.
- class apache_beam.runners.interactive.caching.streaming_cache.StreamingCacheSource(cache_dir, labels, is_cache_complete=None, coder=None)[source]
Bases:
object
A class that reads and parses TestStreamFile(Header|Reader)s.
This source operates in the following way:
Wait for up to timeout_secs for the file to be available.
Read, parse, and emit the entire contents of the file
Wait for more events to come or until is_cache_complete returns True
If there are more events, then go to 2
Otherwise, stop emitting.
This class is used to read from file and send its to the TestStream via the StreamingCacheManager.Reader.
- class apache_beam.runners.interactive.caching.streaming_cache.StreamingCache(cache_dir, is_cache_complete=None, sample_resolution_sec=0.1, saved_pcoders=None)[source]
Bases:
CacheManager
Abstraction that holds the logic for reading and writing to cache.
- property capture_size
- property capture_paths
- property capture_keys
- read_multiple(labels, tail=True)[source]
Returns a generator to read all records from file.
Does tail until the cache is complete. This is because it is used in the TestStreamServiceController to read from file which is only used during pipeline runtime which needs to block.
- source(*labels)[source]
Returns the StreamingCacheManager source.
This is beam.Impulse() because unbounded sources will be marked with this and then the PipelineInstrument will replace these with a TestStream.
- sink(labels, is_capture=False)[source]
Returns a StreamingCacheSink to write elements to file.
Note that this is assumed to only work in the DirectRunner as the underlying StreamingCacheSink assumes a single machine to have correct element ordering.
- class Reader(headers, readers)[source]
Bases:
object
Abstraction that reads from PCollection readers.
This class is an Abstraction layer over multiple PCollection readers to be used for supplying a TestStream service with events.
This class is also responsible for holding the state of the clock, injecting clock advancement events, and watermark advancement events.