It appears that Hive, Impala, Pig, and others all provide SQL or SQL-like access to data stored on Hadoop clusters. They all seem to have support for HDFS, S3, and other forms.
So why are there so many different ways for accessing Hadoop information by SQL, how are they different, and how does their performance compare?
Do we have so many different versions because all of the projects were started at the same time for more or less the same reason? If so, is there an advantage to knowing more than one of them?
I have found several articles that attempt to explain the differences (e.g. 10 ways to query hadoop with SQL and Selecting the right SQL on Hadoop, but mostly they just list features.
0 Answers