mergeSchemasInParallel( sparkSession: SparkSession, parameters: Map[String, String], files: Seq[FileStatus], schemaReader: (Seq[FileStatus], Configuration, Boolean) => Seq[StructType]): Option[StructType]
mergeSchemasInParallel determines a merged schema with a distributed Spark job.
mergeSchemasInParallel creates an RDD with file paths and their lenght with the number of partitions up to the default parallelism (number of CPU cores in a cluster).
In the end,
mergeSchemasInParallel collects the RDD result that are merged schemas for files (per partition) that
mergeSchemasInParallel merge all together to give the final merge schema.
mergeSchemasInParallel is used when:
OrcFileFormatis requested to
OrcUtilsis requested to infer schema
ParquetFileFormatis requested to infer schema