How To: Reduce data in shuffle stage for Spark SQL queries

Description:

Consider we have a query like this:

select *
from (select * from SEGMENT_EVENTS_1MONTH e
join SEGMENT_LOOKUP_VALUES_KL l
on e.segment=l.segment_code and upper(l.segment_name) like 'NCS -%') a
left join query3_tbl_b b
on a.user_id=b.user_id

It will incur a lot of shuffle data with the subquery. This will lead to many resource issues.

How To:

To reduce the shuffle data, we can create a small table with the subquery, and rewrite into something like this:

select *
from query3_tbl_a a
left join query3_tbl_b b
on a.user_id=b.user_id

Have more questions? Submit a request

Comments

Powered by Zendesk