March 19, 2019, 2:20 p.m.
I was running an AWS Glue job where I was reading Parquet files from an S3 bucket. When I loaded the files individually there were no problems, but when I loaded the entire directory and tried to do any sort of transformation on the data I got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 23, ip-172-31-4-9.eu-west-1.compute.internal, executor 1): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://...
I couldn't find any useful information about this online, so thought I'd post the solution here in case anyone else has the same problem.
The problem was that some of the files had columns which were entirely Null and apparently Spark doesn't like that. Reading the files individually I guess it probably read the schema from the file, but reading them as a whole apparently caused errors. I solved the problem by dropping any Null columns before writing the Parquet files.
March 7, 2019, 4:52 p.m.
I've been working with AWS Lambda recently and I am very impressed. Usually if I need to set up a microservice or a recurring task or anything like that I'll just set up something on one of my virtual servers so I didn't think Lambda would be all that useful. But it makes it really, really easy to set up little tasks and it is much cheaper than having a whole virtual server.
You can create tasks in a number of different languages, and set up a variety of triggers ranging from HTTP requests to scheduled tasks, and when the Lambda is triggered AWS spins it up, executes it and then shuts it down. Since it is so ephemeral it is completely stateless, but you can load files from S3 buckets if you need data of any sort. I assume you can probably also connect to a variety of AWS databases as well, although I haven't done this yet. If you need additional libraries or packages that are not default you can create a layer containing them.
Lambda is not going to replace servers for most use cases, but I think serverless technology is going to make quite a dent in the near future.