Amazon EMR 在 6.9.0 版本开始引入 Delta Lake,测试下来,启用的方式和 Apache Hudi 是非常像的。具体可参考:
本文简单记录一下操作环节。此外,在 EMR 上使用 Delta Lake 也是可以使用 Glue Data Catalog 的。
aws s3 rm --recursive s3://glc-deltalake-test || aws s3 mb s3://glc-deltalake-test
spark-sql \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
-- create a delta lake table with the S3 location
CREATE TABLE delta_table(
id string,
creation_date string,
last_update_time string)
USING delta
LOCATION 's3://glc-deltalake-test/delta_table';
-- insert data into the table
INSERT INTO delta_table VALUES ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z");
-- check results
SELECT * FROM delta_table;