Apache Beam SDK for Python
Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Overview
The key concepts in this programming model are
- PCollection: represents a collection of data, which could be bounded or unbounded in size.
- PTransform: represents a computation that transforms input PCollections into output PCollections.
- Pipeline: manages a directed acyclic graph of- PTransforms and- PCollections that is ready for execution.
- PipelineRunner: specifies where and how the pipeline should execute.
- Read: read from an external source.
- Write: write to an external data sink.
Typical usage
At the top of your source file:
import apache_beam as beam
After this import statement
- Transform classes are available as - beam.FlatMap,- beam.GroupByKey, etc.
- Pipeline class is available as - beam.Pipeline
- Text read/write transforms are available as - beam.io.ReadFromText,- beam.io.WriteToText.
Examples
The examples subdirectory has some examples.
Subpackages
- apache_beam.coders package- Submodules- apache_beam.coders.avro_record module
- apache_beam.coders.coders module- Coder
- AvroGenericCoder
- BooleanCoder
- BytesCoder
- CloudpickleCoder
- DillCoder
- FastPrimitivesCoder
- FloatCoder
- IterableCoder
- ListCoder
- MapCoder
- NullableCoder
- PickleCoder
- ProtoCoder
- ProtoPlusCoder
- ShardedKeyCoder
- SinglePrecisionFloatCoder
- SingletonCoder
- StrUtf8Coder
- TimestampCoder
- TupleCoder
- TupleSequenceCoder
- VarIntCoder
- WindowedValueCoder
- ParamWindowedValueCoder
- BigIntegerCoder
- DecimalCoder
- PaneInfoCoder
 
- apache_beam.coders.observable module
- apache_beam.coders.row_coder module
- apache_beam.coders.slow_stream module
- apache_beam.coders.typecoders module
 
 
- Submodules
- apache_beam.dataframe package- Submodules- apache_beam.dataframe.convert module
- apache_beam.dataframe.doctests module
- apache_beam.dataframe.expressions module
- apache_beam.dataframe.frame_base module
- apache_beam.dataframe.frames module
- apache_beam.dataframe.io module
- apache_beam.dataframe.pandas_top_level_functions module
- apache_beam.dataframe.partitionings module
- apache_beam.dataframe.schemas module
- apache_beam.dataframe.transforms module
 
 
- Submodules
- apache_beam.io package- Subpackages
- Submodules- apache_beam.io.avroio module
- apache_beam.io.concat_source module
- apache_beam.io.debezium module
- apache_beam.io.filebasedsink module
- apache_beam.io.filebasedsource module
- apache_beam.io.fileio module
- apache_beam.io.filesystem module
- apache_beam.io.filesystemio module
- apache_beam.io.filesystems module
- apache_beam.io.hadoopfilesystem module
- apache_beam.io.iobase module
- apache_beam.io.jdbc module
- apache_beam.io.kafka module
- apache_beam.io.kinesis module
- apache_beam.io.localfilesystem module
- apache_beam.io.mongodbio module
- apache_beam.io.parquetio module
- apache_beam.io.range_trackers module
- apache_beam.io.requestresponse module
- apache_beam.io.restriction_trackers module
- apache_beam.io.snowflake module
- apache_beam.io.source_test_utils module
- apache_beam.io.textio module
- apache_beam.io.tfrecordio module
- apache_beam.io.utils module
- apache_beam.io.watermark_estimators module
 
 
- apache_beam.metrics package- Submodules- apache_beam.metrics.cells module
- apache_beam.metrics.metric module
- apache_beam.metrics.metricbase module
- apache_beam.metrics.monitoring_infos module- extract_counter_value()
- extract_gauge_value()
- extract_distribution()
- extract_string_set_value()
- extract_bounded_trie_value()
- create_labels()
- int64_user_counter()
- int64_counter()
- int64_user_distribution()
- int64_distribution()
- int64_user_gauge()
- int64_gauge()
- user_set_string()
- user_bounded_trie()
- create_monitoring_info()
- is_counter()
- is_gauge()
- is_distribution()
- is_string_set()
- is_bounded_trie()
- is_user_monitoring_info()
- extract_metric_result_map_value()
- parse_namespace_and_name()
- get_step_name()
- to_key()
- sum_payload_combiner()
- distribution_payload_combiner()
- consolidate()
 
 
 
- Submodules
- apache_beam.ml package
- apache_beam.options package
- apache_beam.portability package
- apache_beam.runners package
- apache_beam.testing package- Subpackages
- Submodules- apache_beam.testing.datatype_inference module
- apache_beam.testing.extra_assertions module
- apache_beam.testing.metric_result_matchers module
- apache_beam.testing.pipeline_verifiers module
- apache_beam.testing.synthetic_pipeline module- get_generator()
- parse_byte_size()
- div_round_up()
- rotate_key()
- initial_splitting_zipf()
- SyntheticStep
- NonLiquidShardingOffsetRangeTracker
- SyntheticSDFStepRestrictionProvider
- get_synthetic_sdf_step()
- SyntheticSource
- SyntheticSDFSourceRestrictionProvider
- SyntheticSDFAsSource
- ShuffleBarrier
- SideInputBarrier
- merge_using_gbk()
- merge_using_side_input()
- expand_using_gbk()
- expand_using_second_output()
- parse_args()
- run()
- StatefulLoadGenerator
 
- apache_beam.testing.test_pipeline module
- apache_beam.testing.test_stream module
- apache_beam.testing.test_stream_service module
- apache_beam.testing.test_utils module
- apache_beam.testing.util module
 
 
- apache_beam.transforms package- Subpackages
- Submodules- apache_beam.transforms.async_dofn module
- apache_beam.transforms.combinefn_lifecycle_pipeline module
- apache_beam.transforms.combiners module
- apache_beam.transforms.core module
- apache_beam.transforms.create_source module
- apache_beam.transforms.deduplicate module
- apache_beam.transforms.display module
- apache_beam.transforms.enrichment module
- apache_beam.transforms.environments module
- apache_beam.transforms.error_handling module
- apache_beam.transforms.external module- convert_to_typing_type()
- iter_urns()
- PayloadBuilder
- SchemaBasedPayloadBuilder
- ImplicitSchemaPayloadBuilder
- NamedTupleBasedPayloadBuilder
- SchemaTransformPayloadBuilder
- ExplicitSchemaTransformPayloadBuilder
- JavaClassLookupPayloadBuilder
- SchemaTransformsConfig
- ManagedReplacement
- SchemaAwareExternalTransform
- JavaExternalTransform
- AnnotationBasedPayloadBuilder
- DataclassBasedPayloadBuilder
- ExternalTransform
- ExpansionAndArtifactRetrievalStub
- JavaJarExpansionService
- BeamJarExpansionService
- memoize()
 
- apache_beam.transforms.external_java module
- apache_beam.transforms.external_transform_provider module
- apache_beam.transforms.fully_qualified_named_transform module
- apache_beam.transforms.managed module
- apache_beam.transforms.periodicsequence module
- apache_beam.transforms.ptransform module
- apache_beam.transforms.resources module
- apache_beam.transforms.sideinputs module
- apache_beam.transforms.sql module
- apache_beam.transforms.stats module
- apache_beam.transforms.timeutil module
- apache_beam.transforms.trigger module
- apache_beam.transforms.userstate module- StateSpec
- ReadModifyWriteStateSpec
- BagStateSpec
- SetStateSpec
- CombiningValueStateSpec
- OrderedListStateSpec
- Timer
- TimerSpec
- on_timer()
- get_dofn_specs()
- is_stateful_dofn()
- validate_stateful_dofn()
- BaseTimer
- RuntimeTimer
- RuntimeState
- ReadModifyWriteRuntimeState
- AccumulatingRuntimeState
- BagRuntimeState
- SetRuntimeState
- CombiningValueRuntimeState
- OrderedListRuntimeState
- UserStateContext
 
- apache_beam.transforms.util module
- apache_beam.transforms.window module
 
 
- apache_beam.typehints package- Subpackages
- Submodules- apache_beam.typehints.arrow_batching_microbenchmark module
- apache_beam.typehints.arrow_type_compatibility module
- apache_beam.typehints.batch module
- apache_beam.typehints.decorators module
- apache_beam.typehints.intrinsic_one_ops module
- apache_beam.typehints.native_type_compatibility module- match_is_named_tuple()
- extract_optional_type()
- is_any()
- is_new_type()
- is_forward_ref()
- convert_builtin_to_typing()
- convert_typing_to_builtin()
- convert_collections_to_typing()
- is_builtin()
- TypedWindowedValue
- convert_to_beam_type()
- convert_to_beam_types()
- convert_to_python_type()
- convert_to_python_types()
 
- apache_beam.typehints.opcodes module- pop_one()
- pop_two()
- pop_three()
- push_value()
- nop()
- resume()
- pop_top()
- end_for()
- end_send()
- copy()
- rot_n()
- rot_two()
- rot_three()
- rot_four()
- dup_top()
- unary()
- unary_positive()
- unary_negative()
- unary_invert()
- unary_not()
- unary_convert()
- get_iter()
- symmetric_binary_op()
- binary_power()
- inplace_power()
- binary_multiply()
- inplace_multiply()
- binary_divide()
- inplace_divide()
- binary_floor_divide()
- inplace_floor_divide()
- binary_true_divide()
- inplace_true_divide()
- binary_modulo()
- inplace_modulo()
- binary_add()
- inplace_add()
- binary_subtract()
- inplace_subtract()
- binary_subscr()
- binary_lshift()
- inplace_lshift()
- binary_rshift()
- inplace_rshift()
- binary_and()
- inplace_and()
- binary_xor()
- inplace_xor()
- binary_or()
- inplace_or()
- binary_op()
- store_subscr()
- binary_slice()
- store_slice()
- print_item()
- print_newline()
- list_append()
- set_add()
- map_add()
- load_locals()
- exec_stmt()
- build_class()
- unpack_sequence()
- dup_topx()
- store_attr()
- delete_attr()
- store_global()
- delete_global()
- load_const()
- load_name()
- build_tuple()
- build_list()
- build_set()
- build_map()
- build_const_key_map()
- list_to_tuple()
- build_string()
- list_extend()
- set_update()
- dict_update()
- dict_merge()
- load_attr()
- load_method()
- compare_op()
- is_op()
- contains_op()
- import_name()
- import_from()
- load_global()
- store_map()
- load_fast()
- load_fast_load_fast()
- load_fast_check()
- load_fast_and_clear()
- store_fast()
- store_fast_store_fast()
- store_fast_load_fast()
- delete_fast()
- swap()
- reraise()
- gen_start()
- load_closure()
- load_deref()
- make_function()
- set_function_attribute()
- make_closure()
- build_slice()
- to_bool()
- format_value()
- convert_value()
- format_simple()
- format_with_spec()
- build_list_unpack()
- build_set_unpack()
- build_tuple_unpack()
- build_tuple_unpack_with_call()
- build_map_unpack()
 
- apache_beam.typehints.pandas_type_compatibility module
- apache_beam.typehints.pytorch_type_compatibility module
- apache_beam.typehints.row_type module
- apache_beam.typehints.schema_registry module
- apache_beam.typehints.schemas module- named_fields_to_schema()
- named_fields_from_schema()
- typing_to_runner_api()
- typing_from_runner_api()
- value_to_runner_api()
- value_from_runner_api()
- option_to_runner_api()
- option_from_runner_api()
- schema_field()
- SchemaTranslation
- named_tuple_from_schema()
- named_tuple_to_schema()
- schema_from_element_type()
- named_fields_from_element_type()
- union_schema_type()
- LogicalTypeRegistry
- LogicalType
- NoArgumentLogicalType
- PassThroughLogicalType
- MicrosInstantRepresentation
- MillisInstant
- MicrosInstant
- PythonCallable
- FixedPrecisionDecimalArgumentRepresentation
- DecimalLogicalType
- FixedPrecisionDecimalLogicalType
- FixedBytes
- VariableBytes
- FixedString
- VariableString
- JdbcDateType
- JdbcTimeType
 
- apache_beam.typehints.sharded_key_type module
- apache_beam.typehints.trivial_inference module
- apache_beam.typehints.typecheck module
- apache_beam.typehints.typehints module
 
 
- apache_beam.utils package- Submodules- apache_beam.utils.annotations module
- apache_beam.utils.histogram module
- apache_beam.utils.interactive_utils module
- apache_beam.utils.multi_process_shared module
- apache_beam.utils.plugin module
- apache_beam.utils.processes module
- apache_beam.utils.profiler module
- apache_beam.utils.proto_utils module
- apache_beam.utils.python_callable module
- apache_beam.utils.retry module- PermanentException
- FuzzedExponentialIntervals
- retry_on_server_errors_filter()
- retry_on_server_errors_and_notfound_filter()
- retry_on_server_errors_and_timeout_filter()
- retry_on_server_errors_timeout_or_quota_issues_filter()
- retry_on_beam_io_error_filter()
- retry_if_valid_input_but_server_error_and_timeout_filter()
- Clock
- no_retries()
- with_exponential_backoff()
 
- apache_beam.utils.sentinel module
- apache_beam.utils.sharded_key module
- apache_beam.utils.shared module
- apache_beam.utils.subprocess_server module
- apache_beam.utils.thread_pool_executor module
- apache_beam.utils.timestamp module
- apache_beam.utils.transform_service_launcher module
- apache_beam.utils.urns module
 
 
- Submodules
- apache_beam.yaml package- Subpackages
- Submodules- apache_beam.yaml.cache_provider_artifacts module
- apache_beam.yaml.conftest module
- apache_beam.yaml.generate_yaml_docs module
- apache_beam.yaml.json_utils module
- apache_beam.yaml.main module
- apache_beam.yaml.options module
- apache_beam.yaml.yaml_combine module
- apache_beam.yaml.yaml_enrichment module
- apache_beam.yaml.yaml_errors module
- apache_beam.yaml.yaml_io module
- apache_beam.yaml.yaml_join module
- apache_beam.yaml.yaml_mapping module
- apache_beam.yaml.yaml_ml module
- apache_beam.yaml.yaml_provider module- NotAvailableWithReason
- Provider
- as_provider()
- as_provider_list()
- ExternalProvider
- java_jar()
- maven_jar()
- beam_jar()
- docker()
- RemoteProvider
- ExternalJavaProvider
- python()
- ExternalPythonProvider
- YamlProvider
- fix_pycallable()
- InlineProvider
- MetaInlineProvider
- get_default_sql_provider()
- SqlBackedProvider
- element_to_rows()
- dicts_to_rows()
- YamlProviders
- TranslatingProvider
- create_java_builtin_provider()
- PypiExpansionService
- RenamingProvider
- load_providers()
- parse_providers()
- merge_providers()
- standard_providers()
 
- apache_beam.yaml.yaml_specifiable module
- apache_beam.yaml.yaml_testing module
- apache_beam.yaml.yaml_transform module
- apache_beam.yaml.yaml_utils module
 
 
Submodules
- apache_beam.error module
- apache_beam.pipeline module- Pipeline- Pipeline.runner_implemented_transforms()
- Pipeline.display_data()
- Pipeline.options
- Pipeline.allow_unsafe_triggers
- Pipeline.transform_annotations()
- Pipeline.replace_all()
- Pipeline.run()
- Pipeline.visit()
- Pipeline.apply()
- Pipeline.to_runner_api()
- Pipeline.merge_compatible_environments()
- Pipeline.from_runner_api()
 
- transform_annotations()
 
- apache_beam.pvalue module