Ask Your Question

When using Polars and Glob to read Parquet from S3, why does it only read the first file?

asked 2021-12-04 11:00:00 +0000

djk gravatar image

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2022-01-31 07:00:00 +0000

lakamha gravatar image

There could be a few reasons why only the first file is being read when using Polars and Glob to read Parquet from S3:

  1. Incorrect file path: Make sure the file path you are providing to the glob.glob() function is correct and includes the full file path, including the bucket name and file extension.

  2. Parallelism settings: By default, Polars and Glob only read one file at a time. If you have multiple files in your S3 bucket, you can increase the number of partitions to read all the files in parallel. You can do this by setting the n_workers parameter to a value greater than 1.

  3. Memory limitations: It's possible that your machine does not have enough memory to read all the files at once. In this case, you can try reading the files in batches by setting the row_group_size parameter to a smaller value.

  4. Parquet file compatibility: Make sure the Parquet file you are trying to read is compatible with Polars. If the schema of the file is not compatible, you may encounter errors or only be able to read a portion of the file.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer

Question Tools


Asked: 2021-12-04 11:00:00 +0000

Seen: 24 times

Last updated: Jan 31 '22