SchemaMergeUtils¶
mergeSchemasInParallel¶
mergeSchemasInParallel(
sparkSession: SparkSession,
parameters: Map[String, String],
files: Seq[FileStatus],
schemaReader: (Seq[FileStatus], Configuration, Boolean) => Seq[StructType]): Option[StructType]
mergeSchemasInParallel determines a merged schema with a distributed Spark job.
mergeSchemasInParallel creates an RDD with file paths and their lenght with the number of partitions up to the default parallelism (number of CPU cores in a cluster).
In the end, mergeSchemasInParallel collects the RDD result that are merged schemas for files (per partition) that mergeSchemasInParallel merge all together to give the final merge schema.
mergeSchemasInParallel is used when:
OrcFileFormatis requested toinferSchemaOrcUtilsis requested to infer schemaParquetFileFormatis requested to infer schema