Skip to content

SchemaMergeUtils

mergeSchemasInParallel

mergeSchemasInParallel(
  sparkSession: SparkSession,
  parameters: Map[String, String],
  files: Seq[FileStatus],
  schemaReader: (Seq[FileStatus], Configuration, Boolean) => Seq[StructType]): Option[StructType]

mergeSchemasInParallel determines a merged schema with a distributed Spark job.


mergeSchemasInParallel creates an RDD with file paths and their lenght with the number of partitions up to the default parallelism (number of CPU cores in a cluster).

In the end, mergeSchemasInParallel collects the RDD result that are merged schemas for files (per partition) that mergeSchemasInParallel merge all together to give the final merge schema.


mergeSchemasInParallel is used when:

  • OrcFileFormat is requested to inferSchema
  • OrcUtils is requested to infer schema
  • ParquetFileFormat is requested to infer schema