Ctrl+K
Logo image Logo image

Site Navigation

  • API Reference
  • Examples

Site Navigation

  • API Reference
  • Examples

Section Navigation

  • PyFlink Table
  • PyFlink DataStream
    • StreamExecutionEnvironment
    • DataStream
    • Functions
    • State
    • Timer
    • Window
    • Checkpoint
    • Side Outputs
    • Connectors
    • Formats
  • PyFlink Common

pyflink.datastream.connectors.file_system.FileSource#

class FileSource(j_file_source)[source]#

A unified data source that reads files - both in batch and in streaming mode.

This source supports all (distributed) file systems and object stores that can be accessed via the Flink’s FileSystem class.

Start building a file source via one of the following calls:

  • for_record_stream_format()

This creates a FileSourceBuilder on which you can configure all the properties of the file source.

<h2>Batch and Streaming</h2>

This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.

When you start creating a file source (via the FileSourceBuilder created through one of the above-mentioned methods) the source is by default in bounded/batch mode. Call monitor_continuously() to put the source into continuous streaming mode.

<h2>Format Types</h2>

The reading of each file happens through file readers defined by <i>file formats</i>. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.

  • A StreamFormat reads the contents of a file from a file stream. It is the simplest format to implement, and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply (such as object reuse, batching, etc.).

<h2>Discovering / Enumerating Files</h2>

The way that the source lists the files to be processes is defined by the FileEnumeratorProvider. The FileEnumeratorProvider is responsible to select the relevant files (for example filter out hidden files) and to optionally splits files into multiple regions (= file source splits) that can be read in parallel).

Methods

for_bulk_file_format(bulk_format, *paths)

for_record_stream_format(stream_format, *paths)

Builds a new FileSource using a StreamFormat to read record-by-record from a file stream.

get_java_function()

previous

pyflink.datastream.connectors.file_system.FileSourceBuilder

next

pyflink.datastream.connectors.file_system.BucketAssigner

On this page
  • FileSource
Show Source

Created using Sphinx 5.3.0.