Kafka Options¶
Options with kafka.
prefix (e.g. kafka.bootstrap.servers) are considered configuration properties for the Kafka consumers used on the driver and executors.
Kafka options are defined as part of KafkaSourceProvider.
assign¶
Topic subscription strategy that accepts a JSON with topic names and partitions, e.g.
{"topicA":[0,1],"topicB":[0,1]}
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider
validates before creating KafkaSource
).
endingOffsets¶
endingTimestamp¶
endingOffsetsByTimestamp¶
failOnDataLoss¶
Default: true
Used when:
KafkaSourceProvider
is requested for failOnDataLoss
fetchOffset.numRetries¶
fetchOffset.retryIntervalMs¶
groupIdPrefix¶
includeHeaders¶
Default: false
kafka.bootstrap.servers¶
(required) bootstrap.servers
configuration property of the Kafka consumers used on the driver and executors
Default: (empty)
kafkaConsumer.pollTimeoutMs¶
The time (in milliseconds) spent waiting in Consumer.poll
if data is not available in the buffer.
Default: spark.network.timeout
or 120s
maxOffsetsPerTrigger¶
Number of records to fetch per trigger (to limit the number of records to fetch).
Default: (undefined)
Unless defined, KafkaSource
requests KafkaOffsetReader for the latest offsets.
maxTriggerDelay¶
Default: 15m
For streaming queries only (ignored in batch queries)
Used when:
KafkaMicroBatchStream
is requested for maxTriggerDelayMsKafkaSource
is requested for maxTriggerDelayMs
minOffsetsPerTrigger¶
Minimum number of records to read in a micro-batch
Default: (undefined)
For streaming queries only (ignored in batch queries)
Validated in validateGeneralOptions
minOffsetsPerTrigger vs minOffsetPerTrigger
The option's name is minOffsetsPerTrigger
(with s
) while in the code itself they refer to it by minOffsetPerTrigger
(a singular offset
).
Used when:
minPartitions¶
Minimum number of partitions per executor (given Kafka partitions)
Default: (undefined)
Must be undefined (default) or greater than 0
Review
When undefined (default) or smaller than the number of TopicPartitions
with records to consume from, KafkaMicroBatchReader
uses KafkaOffsetRangeCalculator
to find the preferred executor for every TopicPartition
(and the available executors).
startingOffsets¶
Starting offsets
Default: latest
Possible values:
-
latest
-
earliest
-
JSON with topics, partitions and their starting offsets, e.g.
{"topicA":{"part":offset,"p1":-1},"topicB":{"0":-2}}
Tip
Use Scala's tripple quotes for the JSON for topics, partitions and offsets.
option(
"startingOffsets",
"""{"topic1":{"0":5,"4":-1},"topic2":{"0":-2}}""")
startingOffsetsByTimestamp¶
startingTimestamp¶
subscribe¶
SubscribeStrategy to use that accepts topic names as a comma-separated string, e.g.
topic1,topic2,topic3
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider
validates before creating KafkaSource
).
subscribePattern¶
Topic subscription strategy that uses Java's java.util.regex.Pattern for the topic subscription regex pattern of topics to subscribe to, e.g.
topic\d
Tip
Use Scala's tripple quotes for the regular expression for topic subscription regex pattern.
option("subscribepattern", """topic\d""")
Exactly one topic subscription strategy is allowed (that KafkaSourceProvider
validates before creating KafkaSource
).
topic¶
Optional topic name to use for writing the result of a streaming query to
Default: (empty)
Unless defined, Kafka data source uses the topic names as defined in the topic
field in the dataset