If it wasn't clear from my other blogs, I LOVE polybase. It's a super powerful tool that seamlessly integrates flat-file parallelism with relational querying. However - there are some ways that the implementation really has to improve - These are true as of June 2017 - given how fast things change, I'm hoping they won't be limitations for too much longer…
This one is a biggie - if you're querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:
SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016
This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases - especially when you're using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL - hopefully it's only a matter of time before a similar feature is implemented here.
As a BI Developer, I love metadata - I want to know where a record came from, which flat file it was carried into the Lake aboard. The ability to query from thousands of files at once recursively is insanely useful - it gives you the ability to keep delta files as an immutable raw layer that you can query as a single entity. However, there is currently no ability to include the "filename" within the columns returned by the external table.
Again, this is something that you can do in equivalent technologies and seems mad that it's not included by default. When you look at the DMVs underneath the external readers, you can see, in detail, the file that each thread is currently reading, it's very aware of the file that it's pulling data from.
Just query the sys.dm_exec_external_work table whilst running a query over a large set of files in an external table and you'll see what I mean.
This one is even more frustrating - Polybase has a really useful feature where you can tell it to reject rows that fail to be parsed into the data types you've specified.
The most common usage for this is around file headers, discussed in the next limitation. However - the reject functionality has two flaws:
Firstly - you can't see the rows that failed. Such a simple thing, but if you're using Polybase as a data loading tool for an analytics warehouse, you can't accept that some rows simply won't make it through and you'll never know which rows failed.
This is a tricky limitation however - I have no idea how they would model this effectively within SQLDW. If we were in the standard SQL engine, I'd like to see External Tables to have an option to automatically have a shadow "External Error Rows" table created. Any row that fails parsing gets dumped into that table as a JSON record with a query date and object_id.
I've actually raised the idea on connect here: http://bit.ly/2rWT35D
There is also the issue of speed - I've seen real performance problems when having rejection percentages turned on, the continuous checking against tolerance seems to be a big drain on execution.
Header Row Skipping
Related to the above issue is Header Row skipping. Such a common thing to do, available in U-SQL, HIVE and ETL tools going back to the dawn of time… but Polybase cannot currently skip header rows. Preferably this would give you the option of skipping x rows of the file for situations where there is a metadata/header section at the top of your file.
Currently, you either have to accept your fate and perform some pre-processing on files to strip out and header data, maybe by using U-SQL or accept the performance overhead of having a reject rows tolerance set within your external tables.
None of the above are deal-breakers for me - the sheer speed and opportunities opened up by being able to query over a huge flat-file store, whilst still within your relational engine more than makes up for the limitations. However many of the limitations don't exist within similar tools - this at least gives me hope that they can be fixed and we should be seeing improvements over time.
Unfortunately, for the time being the focus seems to be on increasing Polybase compatibility - adding further data stores and additional security protocols rather than improving the functionality of the tool itself.
In the next post, I want to share some design patterns about Polybase use cases – dynamically switching external locations, wider architectures and more.