There are several methods for compressing two files in Databricks, depending on the type of data and the desired outcome. Here are a few examples:
import gzip
with open('/dbfs/path/to/file1.csv', 'rb') as f_in:
with gzip.open('/dbfs/path/to/file1.csv.gz', 'wb') as f_out:
f_out.writelines(f_in)
with open('/dbfs/path/to/file2.csv', 'rb') as f_in:
with gzip.open('/dbfs/path/to/file2.csv.gz', 'wb') as f_out:
f_out.writelines(f_in)
This will read in the contents of each file, compress them using gzip, and write the compressed files to disk with a .gz extension.
import zipfile
with zipfile.ZipFile('/dbfs/path/to/archive.zip', 'w') as myzip:
myzip.write('/dbfs/path/to/file1.csv')
myzip.write('/dbfs/path/to/file2.csv')
This will create a new zip archive file at the specified path that contains both input files.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp").getOrCreate()
df1 = spark.read.parquet('/path/to/file1.parquet')
df2 = spark.read.parquet('/path/to/file2.parquet')
merged = df1.union(df2)
merged.write.parquet('/path/to/merged.parquet', compression='gzip')
This will create a new Parquet file that contains the combined data from both input files, compressed using gzip.
Asked: 2022-09-09 11:00:00 +0000
Seen: 15 times
Last updated: Dec 28 '22