SchemaMergeUtils¶
mergeSchemasInParallel¶
mergeSchemasInParallel(
sparkSession: SparkSession,
parameters: Map[String, String],
files: Seq[FileStatus],
schemaReader: (Seq[FileStatus], Configuration, Boolean) => Seq[StructType]): Option[StructType]
mergeSchemasInParallel
determines a merged schema with a distributed Spark job.
mergeSchemasInParallel
creates an RDD with file paths and their lenght with the number of partitions up to the default parallelism (number of CPU cores in a cluster).
In the end, mergeSchemasInParallel
collects the RDD result that are merged schemas for files (per partition) that mergeSchemasInParallel
merge all together to give the final merge schema.
mergeSchemasInParallel
is used when:
OrcFileFormat
is requested toinferSchema
OrcUtils
is requested to infer schemaParquetFileFormat
is requested to infer schema