Apache Beam SDK for Python¶
Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines.
The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language.
Overview¶
The key concepts in this programming model are
PCollection
: represents a collection of data, which could be bounded or unbounded in size.PTransform
: represents a computation that transforms input PCollections into output PCollections.Pipeline
: manages a directed acyclic graph ofPTransform
s andPCollection
s that is ready for execution.PipelineRunner
: specifies where and how the pipeline should execute.Read
: read from an external source.Write
: write to an external data sink.
Typical usage¶
At the top of your source file:
import apache_beam as beam
After this import statement
Transform classes are available as
beam.FlatMap
,beam.GroupByKey
, etc.Pipeline class is available as
beam.Pipeline
Text read/write transforms are available as
beam.io.ReadFromText
,beam.io.WriteToText
.
Examples
The examples subdirectory has some examples.
Subpackages¶
- apache_beam.coders package
- Submodules
- apache_beam.coders.avro_record module
- apache_beam.coders.coders module
Coder
AvroGenericCoder
BooleanCoder
BytesCoder
DillCoder
FastPrimitivesCoder
FloatCoder
IterableCoder
ListCoder
MapCoder
NullableCoder
PickleCoder
ProtoCoder
ProtoPlusCoder
ShardedKeyCoder
SinglePrecisionFloatCoder
SingletonCoder
StrUtf8Coder
TimestampCoder
TupleCoder
TupleSequenceCoder
VarIntCoder
WindowedValueCoder
ParamWindowedValueCoder
BigIntegerCoder
DecimalCoder
- apache_beam.coders.observable module
- apache_beam.coders.row_coder module
- apache_beam.coders.slow_stream module
- apache_beam.coders.typecoders module
- Submodules
- apache_beam.dataframe package
- Submodules
- apache_beam.dataframe.convert module
- apache_beam.dataframe.doctests module
- apache_beam.dataframe.expressions module
- apache_beam.dataframe.frame_base module
- apache_beam.dataframe.frames module
- apache_beam.dataframe.io module
- apache_beam.dataframe.pandas_top_level_functions module
- apache_beam.dataframe.partitionings module
- apache_beam.dataframe.schemas module
- apache_beam.dataframe.transforms module
- Submodules
- apache_beam.io package
- Subpackages
- Submodules
- apache_beam.io.avroio module
- apache_beam.io.concat_source module
- apache_beam.io.debezium module
- apache_beam.io.filebasedsink module
- apache_beam.io.filebasedsource module
- apache_beam.io.fileio module
- apache_beam.io.filesystem module
- apache_beam.io.filesystemio module
- apache_beam.io.filesystems module
- apache_beam.io.hadoopfilesystem module
- apache_beam.io.iobase module
- apache_beam.io.jdbc module
- apache_beam.io.kafka module
- apache_beam.io.kinesis module
- apache_beam.io.localfilesystem module
- apache_beam.io.mongodbio module
- apache_beam.io.parquetio module
- apache_beam.io.range_trackers module
- apache_beam.io.requestresponse module
- apache_beam.io.restriction_trackers module
- apache_beam.io.snowflake module
- apache_beam.io.source_test_utils module
- apache_beam.io.textio module
- apache_beam.io.tfrecordio module
- apache_beam.io.utils module
- apache_beam.io.watermark_estimators module
- apache_beam.metrics package
- Submodules
- apache_beam.metrics.cells module
- apache_beam.metrics.metric module
- apache_beam.metrics.metricbase module
- apache_beam.metrics.monitoring_infos module
extract_counter_value()
extract_gauge_value()
extract_distribution()
extract_string_set_value()
create_labels()
int64_user_counter()
int64_counter()
int64_user_distribution()
int64_distribution()
int64_user_gauge()
int64_gauge()
user_set_string()
create_monitoring_info()
is_counter()
is_gauge()
is_distribution()
is_string_set()
is_user_monitoring_info()
extract_metric_result_map_value()
parse_namespace_and_name()
get_step_name()
to_key()
sum_payload_combiner()
distribution_payload_combiner()
consolidate()
- Submodules
- apache_beam.ml package
- apache_beam.options package
- apache_beam.portability package
- apache_beam.runners package
- apache_beam.testing package
- Subpackages
- Submodules
- apache_beam.testing.datatype_inference module
- apache_beam.testing.extra_assertions module
- apache_beam.testing.metric_result_matchers module
- apache_beam.testing.pipeline_verifiers module
- apache_beam.testing.synthetic_pipeline module
get_generator()
parse_byte_size()
div_round_up()
rotate_key()
initial_splitting_zipf()
SyntheticStep
NonLiquidShardingOffsetRangeTracker
SyntheticSDFStepRestrictionProvider
get_synthetic_sdf_step()
SyntheticSource
SyntheticSDFSourceRestrictionProvider
SyntheticSDFAsSource
ShuffleBarrier
SideInputBarrier
merge_using_gbk()
merge_using_side_input()
expand_using_gbk()
expand_using_second_output()
parse_args()
run()
StatefulLoadGenerator
- apache_beam.testing.test_pipeline module
- apache_beam.testing.test_stream module
- apache_beam.testing.test_stream_service module
- apache_beam.testing.test_utils module
- apache_beam.testing.util module
- apache_beam.transforms package
- Subpackages
- Submodules
- apache_beam.transforms.combinefn_lifecycle_pipeline module
- apache_beam.transforms.combiners module
- apache_beam.transforms.core module
- apache_beam.transforms.create_source module
- apache_beam.transforms.deduplicate module
- apache_beam.transforms.display module
- apache_beam.transforms.enrichment module
- apache_beam.transforms.environments module
- apache_beam.transforms.error_handling module
- apache_beam.transforms.external module
convert_to_typing_type()
iter_urns()
PayloadBuilder
SchemaBasedPayloadBuilder
ImplicitSchemaPayloadBuilder
NamedTupleBasedPayloadBuilder
SchemaTransformPayloadBuilder
ExplicitSchemaTransformPayloadBuilder
JavaClassLookupPayloadBuilder
SchemaTransformsConfig
SchemaAwareExternalTransform
JavaExternalTransform
AnnotationBasedPayloadBuilder
DataclassBasedPayloadBuilder
ExternalTransform
ExpansionAndArtifactRetrievalStub
JavaJarExpansionService
BeamJarExpansionService
memoize()
- apache_beam.transforms.external_java module
- apache_beam.transforms.external_transform_provider module
- apache_beam.transforms.fully_qualified_named_transform module
- apache_beam.transforms.periodicsequence module
- apache_beam.transforms.ptransform module
- apache_beam.transforms.resources module
- apache_beam.transforms.sideinputs module
- apache_beam.transforms.sql module
- apache_beam.transforms.stats module
- apache_beam.transforms.timeutil module
- apache_beam.transforms.trigger module
- apache_beam.transforms.userstate module
StateSpec
ReadModifyWriteStateSpec
BagStateSpec
SetStateSpec
CombiningValueStateSpec
Timer
TimerSpec
on_timer()
get_dofn_specs()
is_stateful_dofn()
validate_stateful_dofn()
BaseTimer
RuntimeTimer
RuntimeState
ReadModifyWriteRuntimeState
AccumulatingRuntimeState
BagRuntimeState
SetRuntimeState
CombiningValueRuntimeState
UserStateContext
- apache_beam.transforms.util module
- apache_beam.transforms.window module
- apache_beam.typehints package
- Subpackages
- Submodules
- apache_beam.typehints.arrow_batching_microbenchmark module
- apache_beam.typehints.arrow_type_compatibility module
- apache_beam.typehints.batch module
- apache_beam.typehints.decorators module
- apache_beam.typehints.intrinsic_one_ops module
- apache_beam.typehints.native_type_compatibility module
- apache_beam.typehints.opcodes module
pop_one()
pop_two()
pop_three()
push_value()
nop()
resume()
pop_top()
end_for()
end_send()
copy()
rot_n()
rot_two()
rot_three()
rot_four()
dup_top()
unary()
unary_positive()
unary_negative()
unary_invert()
unary_not()
unary_convert()
get_iter()
symmetric_binary_op()
binary_power()
inplace_power()
binary_multiply()
inplace_multiply()
binary_divide()
inplace_divide()
binary_floor_divide()
inplace_floor_divide()
binary_true_divide()
inplace_true_divide()
binary_modulo()
inplace_modulo()
binary_add()
inplace_add()
binary_subtract()
inplace_subtract()
binary_subscr()
binary_lshift()
inplace_lshift()
binary_rshift()
inplace_rshift()
binary_and()
inplace_and()
binary_xor()
inplace_xor()
binary_or()
inplace_or()
binary_op()
store_subscr()
binary_slice()
store_slice()
print_item()
print_newline()
list_append()
set_add()
map_add()
load_locals()
exec_stmt()
build_class()
unpack_sequence()
dup_topx()
store_attr()
delete_attr()
store_global()
delete_global()
load_const()
load_name()
build_tuple()
build_list()
build_set()
build_map()
build_const_key_map()
list_to_tuple()
list_extend()
set_update()
dict_update()
dict_merge()
load_attr()
load_method()
compare_op()
is_op()
contains_op()
import_name()
import_from()
load_global()
store_map()
load_fast()
load_fast_check()
load_fast_and_clear()
store_fast()
delete_fast()
swap()
reraise()
gen_start()
load_closure()
load_deref()
make_function()
make_closure()
build_slice()
build_list_unpack()
build_set_unpack()
build_tuple_unpack()
build_tuple_unpack_with_call()
build_map_unpack()
- apache_beam.typehints.pandas_type_compatibility module
- apache_beam.typehints.pytorch_type_compatibility module
- apache_beam.typehints.row_type module
- apache_beam.typehints.schema_registry module
- apache_beam.typehints.schemas module
named_fields_to_schema()
named_fields_from_schema()
typing_to_runner_api()
typing_from_runner_api()
value_to_runner_api()
value_from_runner_api()
option_to_runner_api()
option_from_runner_api()
schema_field()
SchemaTranslation
named_tuple_from_schema()
named_tuple_to_schema()
schema_from_element_type()
named_fields_from_element_type()
union_schema_type()
LogicalTypeRegistry
LogicalType
NoArgumentLogicalType
PassThroughLogicalType
MicrosInstantRepresentation
MillisInstant
MicrosInstant
PythonCallable
FixedPrecisionDecimalArgumentRepresentation
DecimalLogicalType
FixedPrecisionDecimalLogicalType
FixedBytes
VariableBytes
FixedString
VariableString
- apache_beam.typehints.sharded_key_type module
- apache_beam.typehints.trivial_inference module
- apache_beam.typehints.typecheck module
- apache_beam.typehints.typehints module
- apache_beam.utils package
- Submodules
- apache_beam.utils.annotations module
- apache_beam.utils.histogram module
- apache_beam.utils.interactive_utils module
- apache_beam.utils.multi_process_shared module
- apache_beam.utils.plugin module
- apache_beam.utils.processes module
- apache_beam.utils.profiler module
- apache_beam.utils.proto_utils module
- apache_beam.utils.python_callable module
- apache_beam.utils.retry module
PermanentException
FuzzedExponentialIntervals
retry_on_server_errors_filter()
retry_on_server_errors_and_notfound_filter()
retry_on_server_errors_and_timeout_filter()
retry_on_server_errors_timeout_or_quota_issues_filter()
retry_on_beam_io_error_filter()
retry_if_valid_input_but_server_error_and_timeout_filter()
Clock
no_retries()
with_exponential_backoff()
- apache_beam.utils.sentinel module
- apache_beam.utils.sharded_key module
- apache_beam.utils.shared module
- apache_beam.utils.subprocess_server module
- apache_beam.utils.thread_pool_executor module
- apache_beam.utils.timestamp module
- apache_beam.utils.transform_service_launcher module
- apache_beam.utils.urns module
- Submodules
- apache_beam.yaml package
- Submodules
- apache_beam.yaml.cache_provider_artifacts module
- apache_beam.yaml.generate_yaml_docs module
- apache_beam.yaml.json_utils module
- apache_beam.yaml.main module
- apache_beam.yaml.options module
- apache_beam.yaml.yaml_combine module
- apache_beam.yaml.yaml_io module
- apache_beam.yaml.yaml_join module
- apache_beam.yaml.yaml_mapping module
- apache_beam.yaml.yaml_ml module
- apache_beam.yaml.yaml_provider module
Provider
as_provider()
as_provider_list()
ExternalProvider
java_jar()
maven_jar()
beam_jar()
docker()
RemoteProvider
ExternalJavaProvider
python()
ExternalPythonProvider
YamlProvider
fix_pycallable()
InlineProvider
MetaInlineProvider
SqlBackedProvider
element_to_rows()
dicts_to_rows()
YamlProviders
TranslatingProvider
create_java_builtin_provider()
PypiExpansionService
RenamingProvider
flatten_included_provider_specs()
parse_providers()
merge_providers()
standard_providers()
- apache_beam.yaml.yaml_transform module
- Submodules