Questions and Answers

How many ways could you have a table with no data (0 rows) in Impala?

CREATE TABLE and then don't do anything.
CTAS that doesn't match any rows.
Some data files but in unrecognized format. (Expect errors during query.)
Data inserted outside Impala's control and you didn't REFRESH yet.
CREATE EXTERNAL TABLE or ALTER TABLE ... SET LOCATION and there was nothing in the HDFS directory.
Partitioned table with some ADD PARTITIONs; directories but no actual data files.
Some files in the data directory but hidden by leading . or _ in their names.

Takeaway

There are logical and physical aspects to the notion of data being in a table.
The familiar notions of DELETE and TRUNCATE TABLE don't apply though.

Examples

make_table_empty.log

What is the smallest non-zero amount of data you could possibly put into an Impala table?

It depends.

The smallest text data file you could create would be 2 bytes, therefore a digit 0-9 or a single character followed by a newline
The smallest data representation would be a Boolean value, but if you put that into a text file it would be 'true' or 'false' plus a carriage return.
If you put the Boolean value into a binary file, it would be encoded possibly into a compact form, but the binary file format might have its own bytes of overhead.
Even a NULL would be represented in text as \N, for a total of 3 characters (including newline).

Takeaway

There is always a consideration of "size on disk" versus "size in memory".
Small variations add up when you have billions of rows.

Examples

tiny_tiny_table.log

What are some of the ways a table can be "big"?

Lots of rows.
Lots of columns, i.e. "wide".
Lots of partitions, i.e. "granular".
Big values in some columns.
Large total data volume.
Large data volume in each partition.
Many data files.

Takeaway

There are different considerations and different sweet spots for each.

Consider partitioning. Choose efficient data types. Look at column characteristics to choose file format, e.g. sparse values or good compressibility favor Parquet.
Crunch data from selected columns? Parquet a good choice. Look up all associated values for single row? HBase a good choice.
Consider reducing granularity to get more data in each partition. Consider dropping old partitions, logically via external table, not necessarily actually get rid of the data. Could potentially run into file descriptor limits during dynamic partitioned insert for Parquet.
Sometimes Impala can read bigger values than it can insert or manipulate, e.g. up to 1 MB for strings.
Look at the file sizes and try to find the sweet spot with # of files vs. # of HDFS blocks. Consider Parquet. Consider partitioning.
If necessary, reduce granularity of partitioning columns. Or consider compacting data in partition to find the sweet spot with # of files vs. # of HDFS blocks.

Examples

big_tables.log

How many ways could you use Impala to calculate 2+2?

SELECT 2+2 -- Alternative to Oracle's DUAL table.
SELECT 2+2 FROM t1 -- but t1 better have exactly one row in it. Exactly like Oracle's DUAL table.
SELECT 2+2 FROM t1 LIMIT 1 -- but t1 better have some data in it.
Put 2,2 into a table, SELECT c1+c2 FROM t1.
INSERT 2 rows, INSERT 2 rows, SELECT COUNT(*) FROM t1.
Put 2 rows in a data file, SELECT COUNT(*) FROM a UNION ALL of the table with itself.
Put 2 rows in a data file, SELECT COUNT(*) FROM a join of the table with itself.
SELECT <query with a big result set> LIMIT 2+2;
Make a column with a 2-character value, SELECT LENGTH(CONCAT(c1,c1)).
Make columns with arbitrary length values, SELECT LENGTH(SUBSTR(c1,1,2)) + LENGTH(REGEXP_EXTRACT(c2,'^(..)',1) WHERE LENGTH(c1) >= 2 AND LENGTH(c2) >= 2 LIMIT 1.

Takeaway

Expressions can be used in lots of familiar SQL contexts.
Learn the significance of clauses like LIMIT, INSERT OVERWRITE, UNION ALL, and names of built-in functions.

Examples

two_plus_two.log

You have a table with N rows. How do you control how many rows are in the result set?

LIMIT puts a cap on the size of the result set.
AND makes the result set smaller.
No WHERE clause? Same number of results as input rows.
Aggregation functions may condense the results.
GROUP BY expands the results from aggregation functions.
UNION expands the result set (but no dupes); UNION ALL even more (allows dupes).
JOIN can make the result set much smaller or much larger. Think through!

Takeaway

You can always UNION, JOIN, WHERE, GROUP BY, and LIMIT your way to a bigger or smaller result set.
LIMIT is a hard stop on number of results.
DISTINCT, WHERE, aggregation functions and GROUP BY make the output smaller than the input.
Joins can make things smaller or larger.
The order of results isn't guaranteed without an ORDER BY.

Examples

ten_rows.log

You have a table with N columns. How do you control how many columns are in the result set?

SELECT * produces N columns.
You can name specific columns, in any order or subset.
Also with aliases to give different names.
You can intermix expressions, function calls, other things that aren't actual columns.

Takeaway

You can subset, rearrange, expressionize, and alias your way to any "shape" of a result set you want. Combine with the techniques from the previous question to fine-tune your result set in both dimensions.

Examples

four_columns.log

For a table with a million (or a billion) rows, how much memory does it take to do SELECT c1 or SELECT COUNT(DISTINCT c1)?

Depends on cardinality of c1, type of c1, average length of c1 values, number of nodes involved in query, and number of cores per host.
Only depends a little on file format.
With more than one DISTINCT column, memory usage increases proportionally based on column size.
Turn on APPX_COUNT_DISTINCT and the answer changes -- memory use is bounded but answer is approximate.

Takeaway

Certain SQL constructs demand close attention in terms of scalability.
Sometimes bigger table doesn't mean more memory overhead for queries.
Use Impala for I/O-intensive queries, i.e. a lot of data read from disk.

Examples

count_distinct.log

Why would you use Impala vs. Hive?

They are complementary. Run Impala on Hive-created tables, or vice versa. Impala is faster and less cryptic and Java-ish for lots of stuff. Run queries side-by-side to see.
You use Hive for multi-hour critical parts of the ETL process. It saves to disk at each step. It has some write/transform capabilities that Impala doesn't yet. Then Impala can pick up the results and run analytic queries on the data.
Some people have cut hours out of their ETL pipelines by switching some stages from Hive to Impala. As Impala gets more built-in functions, file format writers, etc. the balance can shift more.
Just think how many different SQL command lines there are on an Oracle Big Data Appliance. Oracle, MySQL, PostgreSQL, Impala, Hive. More arrows in the quiver.

Takeaway

We are talking about shifting the balance. Impala more for analytics, Hive more for ETL batch jobs.
You still get Hive for free with CDH.
Scope out the time savings.

Examples

Blog post

INSERT INTO t1 SELECT * FROM raw_data - What performance and scalability considerations are there for a copy operation like this?

Logically, the end result is the same. Physically, it depends, and that's where performance and scalability come into the picture.
If destination is a text table, all the data gets expanded to string format and written out to text files. Very simple and predictable, but slow to query due to bulky data and string-to-number/timestamp/Boolean conversions.
Data could also be converted from Parquet back to text; one big file could be split into several; many small files could be compacted into few big ones.
If T2 is a Parquet table, all the data gets organized by column, encoded into binary format, then compressed, and written out to files with bigger-than-normal block size. Slower and more memory-intensive to write, but faster to query.
The overall number of data files depends on the number of nodes that participate in the SELECT portion because they hold data from the source table.
If T3 is a partitioned table, one or more columns at the end of the SELECT * are represented as subdirectories in HDFS, not stored in the data files. The number and sizes of data files is determined by the cardinality of those columns. Queries should use those columns for filtering.
If T4 is a partitioned Parquet table, it could be extra memory-intensive to buffer all those big HDFS data blocks while constructing data for each individual partition. Important to have run COMPUTE STATS on the source table RAW_DATA first.

Takeaway

File format and/or partitioning are important considerations for all write operations. The characteristics of the destination table are very important in INSERT operations.

Examples

N/A, we'll see variations of file formats and partitioning in other examples.

Discuss what's really happening inside aggregation function calls. How does GROUP BY change things?

SELECT COUNT(c2), SUM(c2), MAX(c2), MIN(c2), AVG(c2), COUNT(DISTINCT c2) FROM t1;
SELECT c1, COUNT(c2), SUM(c2), MAX(c2), MIN(c2), AVG(c2), COUNT(DISTINCT c2) FROM t1
  GROUP BY c1;

What's happening internally for each of those function calls?
How does addition of the GROUP BY clause change things?
How does the number of hosts in the cluster affect the processing of these aggregate functions?

For queries without GROUP BY

COUNT keeps a single counter value. A BIGINT since you don't know in advance how many there could be.
SUM keeps a single accumulator value. A BIGINT or DOUBLE since you the sum could overflow the type of the actual column.
MAX and MIN keep a single highest/lowest value. The same type as the column itself.
AVG keeps two intermediate values, equivalent to a SUM and a COUNT. Then divides at the end and only returns the result of the division.
COUNT(DISTINCT) keeps a hash table of all the different values. Therefore strongly prefer to do on a numeric ID column rather than a string, to keep memory usage within sensible bounds for high-cardinality columns.

For queries with GROUP BY

COUNT keeps a single counter value for each different value of the GROUP BY column (c1).
SUM keeps a single accumulator value for each different value of the GROUP BY column (c1).
MAX and MIN keep a single highest/lowest value for each different value of the GROUP BY column (c1).
AVG keeps two intermediate values for each GROUP BY value, equivalent to a SUM and a COUNT.
There could be just a few of these intermediate counter, sum, etc. values (say if C1 was really STATE).
Or there could be billions of these intermediate counter, sum, etc. values (say if C1 was a unique column).

The distribution and parallelization Depends on how many nodes actually hold data relevant for the query:

Each node computes a COUNT of some subset of the data, then all the counts get summed at the end.
Each node computes a SUM of some subset of the data, then all the sums get summed at the end.
Each node computes a MAX or MIN of some subset of the data, then the highest/lowest of these are chosen at the end.
Each node computes intermediate COUNT and SUM of some subset of the data, then those intermediate values are summed at the end and the final division is done.

Takeaway

Aggregation is a crucial aspect of Big Data queries. Function calls, DISTINCT, and GROUP BY all factor in.

Examples

aggregate_memory_usage.log, kings_of_france.log

What are the implications of normalized vs. denormalized data?

Normalized data, now you need IDs. You can finally synthesize those in Impala 2.x with analytic functions.
Normalized data, consistent use of IDs could lead to lower memory usage during early stages of query. E.g. states, zip codes.
Normalized data, you can avoid reading certain columns by not joining those fact tables.
Denormalized data, you can pre-join to ensure local reads and avoid network traffic.
Denormalized data, with Parquet you can avoid reading certain columns even though they're in the data files.
Denormalized data, that might be the form it arrives in, so save conversion step.
Denormalized data, you might have easier time filling up big blocks and avoiding small data files. (Depends on circumstance.)

Takeaway

Physical aspects dominate the considerations for normalized / denormalized in Hadoop.
Lean towards a denormalized layout using Parquet. Fewer joins, exploit column-oriented Parquet format.

Examples

normalize.log

What are some ways "big data" use cases differ from traditional database work?

Now we dig another level deep into the data, looking for insights and trends rather than simple answers.

Old - Total up all sales for all stores for the past quarter.
New - How much do we sell early morning, mid-day, late evening? Should we change opening/closing hours?
Old - Which players are most valuable based on per-game stats?
New - What trends are there minute-by-minute or play-by-play, based on each shot or player movement?
Old - Does person X have enough money in bank account for this purchase? Did something happen just now to indicate fraud?
New - Looking back at known OK or bad purchases, are there subtle trends that could help to detect future fraudulent activity?

Takeaway

Look for ability to dive one level deeper / more granular, and generate insights that you wouldn't have thought of before.
Take opportunity to add one more dimension - time, location, type, amount.

Examples

What is wrong with using a million rows of data? (O'Reilly blog post)

How would I synthesize some data to use?

Generate text files with random or calibrated distributions and ranges.
Shoot for billions of rows / tens of GB of data as a starting point.
INSERT...SELECT variations of existing data back into the same table.
Use ROW_NUMBER() to assign unique IDs, or DENSE_RANK() to assign category IDs.
INSERT...SELECT into a different table to convert to Parquet.
Use now() +/- INTERVAL to generate time sequences.
Be careful doing INSERT...SELECT with 10 years of data into partitioned Parquet table. COMPUTE STATS, consider using multiple INSERTs with static partitioning, be ready to hint SHUFFLE.

Takeaway

Generate a realistic volume and distribution of data. Convert/arrange into realistic layout. "Realistic" is the key for all testing, prototyping, benchmarking.

Examples

synthesize.log, multicols.py, usa_cities.lst, multicols.log

What are the ramifications as cluster size increases?

Special cases for cluster size 1, 2, 3.
Cluster size 4, now you have choice of namenode on same machine as datanode or not.
Cluster size 5+, now some blocks are not on all nodes.
As cluster size increases:
Chance of node failure during query increases, but that's not so bad, just retry.
Chance of "1 slow node" increases.
More of the cluster potentially sits idle during each query. Needs higher volume / more data blocks to keep the cluster "busy". You can get higher scalability without necessarily higher throughput for low-volume queries.

Takeaway

Bigger cluster helps for scalability. Different issues as cluster grows.

Examples

N/A

What are ramifications of cluster topology?

Coordinator node always does a little more work than others.
- Thus round-robin the requests via a load-balancing proxy to spread out this extra load.
The namenode has extra work to do to serve up the block metadata and do metastore stuff.
- Thus when you have a big enough cluster, don't use it as a datanode.
- But do co-locate the Impala statestore and catalog server there.
One slow or low-powered node can drag down query performance.
- Because Impala divides up the work equally.
- E.g. that guy with an 8GB node and a much bigger node. Laptop running out of memory would demolish query performance.
The more nodes, the more "aggregate memory" there is for big aggregations and joins.

Takeaway

Prefer beefy nodes of uniform capacity.

Examples

N/A

What are the ramifications of compression?

Less disk use, less network, but more CPU and potentially more memory.
- Hadoop typically has high-powered disks, big blocks; so modest reduction in I/O not necessarily a big deal.
Works better built into binary formats rather than applied to text.
Gzipped text not splittable. Only supported in 2.x. Memory efficiency improved in 2.1.
- E.g. with Apache logs, maybe use only for initial ingest and then one-time convert to Parquet.
- Keep size of .gz files to < 128 MB, then the non-splittability wouldn't be a big deal.
LZO text requires extra indexing step and
It's the icing on the cake with Parquet; can even turn it off and still get compact data.
Can be applied or not or different codec on a file-by-file basis, not an inherent property of the table.
Parquet + codec likely to be better compression than text + codec due to encoding and rearranging.

Takeaway

Revisit your assumptions from previous experience, and test with realistic data volumes.

Examples

N/A, we'll see compressed and uncompressed data size in other examples.

What does the ETL cycle look like?

The "L" point is open to interpretation. Once data is on HDFS, it's almost all over.
LOAD DATA, CREATE EXTERNAL TABLE, or alter LOCATION of table or partition.
Often one final transform in Impala:
- Convert to Parquet.
- And/or partition the original data.
- Perhaps after some earlier transforms in Hive.
Now with analytic functions, more opportunities to assign IDs.
Recommend using Impala for simpler DDL where practical.
Still early, best practices being developed.

Takeaway

Impala typically comes in near the end of the cycle.

Examples

N/A, we'll see different ways of bringing data into Impala in other examples.

How do you know a column is unique? No nulls? Only contains a specific set of values?

No enforcement of constraints.
Set up those relations earlier in the ETL cycle.
Query if you want to be sure.
Can examine partition key directories to spot nulls or unexpected values.

Takeaway

You're working at large scale so the overhead of enforcement would be too high.
At large scale, rely on statistical significance not perfect precision.

Examples

Some examples of checking assumptions in kings_of_france.log.

What's wrong with SELECT * FROM t1 in a Hadoop / Big Data context?

You don't know in advance how many rows are in t1, could be zillions. Could have changed since last you checked.
You don't know in advance how many columns are in t1, could be thousands. Could have changed since you last checked.
In a table with billions of rows, collecting unnecessary columns for the result set could waste gigabytes of memory.
In a table with billions of rows in Parquet format, reading unnecessary columns could waste gigabytes of I/O.

Takeaway

Some sloppy habits that you could get away with before, are harder to get away with in Hadoop.

Use LIMIT, WHERE, aggregate functions.
Use DESCRIBE to see all the columns. (Table definition might still change afterward.) Use views to define subsets of columns.
Use EXPLAIN to see preliminary estimates of memory usage, SUMMARY to see report of actual memory usage.
Use EXPLAIN to see preliminary estimates of I/O reads, SUMMARY to see report of actual I/O reads.

Pro tip: ALTER TABLE DROP COLUMN to get rid of unused columns from the end. Doesn't actually change data files!

Examples

N/A, we'll see lots of examples of LIMIT and specific column queries in other examples.

How do you recover if you made a mistake in your schema definition? (Schema evolution.)

Table in wrong database (e.g. DEFAULT)? ALTER TABLE t1 RENAME TO db.t1;
Wrong table name? ALTER TABLE t1 RENAME TO t2;
Wrong column name? ALTER TABLE t1 CHANGE c1 c2 <same_type>;
Wrong column type? ALTER TABLE t1 CHANGE c1 c1 <other_type>;
Need an extra column? ALTER TABLE t1 ADD COLUMNS (c7 <type>, c8 <type>, ...);
Don't need that last column? ALTER TABLE t1 DROP COLUMN c8;
Completely different set of columns? ALTER TABLE t1 REPLACE COLUMNS (c1 <type>, c2 <type>, ...);

Takeaway

DDL statements primarily change metadata. Underlying data files untouched.
Impala doesn't require the # of columns to match between table definition and data files.
Impala can synthesize NULLs for missing columns, ignore data from superfluous columns, interpret some kinds of values as different types.

Examples

N/A, these DDL statements illustrated in various examples.

How do you deal with input text files in varied or inconsistent formats?

Set delimiter character for simple delimited files.
Rewrite to remove extraneous quotes etc. for over-ambitious text formats.
Create a view to regularize over-ambitious text formats if you can't change original files.
Then copy to a new table in "good" format as soon as possible.
Create a view or copy to a new table to handle unexpected nulls, duplicates, other relational glitches.

Takeaway

Can handle troublesome data logically (with views) or physically (copying and rewriting).

Examples

text_tables.log

What are some trouble signs to look for in super-complicated queries?

Some table missing stats.
Excessive function calls. Maybe you could precompute in a new table, or collapse into a single UDF (e.g. CONCAT() of several substrings).
Some function result that can't be turned into a constant or codegen'ed.
If the function is on a partition key column, is it inhibiting partition pruning? Check via EXPLAIN.
Scrutinize the innermost loop of any query with subqueries. Make it as selective as possible.

Takeaway

Little trouble spots can turn into big performance bottlenecks when the data volume is high.

Examples

N/A for this "Getting Started" tutorial.

What problems could arise if the data is too small?

Misleading benchmark results if you scaled data down in effort to produce volume every system could handle.
Too-small data in each partition, bottleneck in partition pruning, "many small files" problem. Prefer files close to HDFS block size of 128 MB!
Not enough HDFS blocks = not enough parallelism to take advantage of big cluster.
E.g. 10-node cluster, 9 blocks = idleness. 10-node cluster, 11 blocks = half speed. 10-node cluster, 10 blocks = danger of skew due to HDFS block placement. 10-node cluster, 100 blocks = better chance of even distribution.
Small fixed startup costs like codegen eat up the benefits of parallel execution. SET DISABLE_CODEGEN=true when lots of tiny queries.

Takeaway

Use realistic data volumes. Look for use cases that exceed the capacity of other systems.

Examples

N/A

What are the ways data files could be laid out?

Data files in local filesystem? No good, have to be in HDFS.
Data files all in same directory in HDFS? Want to leave them there? CREATE EXTERNAL TABLE. No partitioning.
Data files all in same directory in HDFS? Don't care if they move? CREATE TABLE + LOAD DATA. Still no partitioning.
Data files all in same directory in HDFS? Want to copy and/or convert them for Impala? CREATE EXTERNAL TABLE + (CREATE TABLE + INSERT...SELECT or CTAS).
For any of these sets of files, they had better all be the same format but they could differ in size, name, compression.

Now we start getting into partitioning...

Could take new set of files, leave in original location, via ALTER TABLE ... ADD PARTITION ... LOCATION.
Could take new set of files, move from original location, via ALTER TABLE ... ADD PARTITION + LOAD DATA PARTITION (...).
Could take new set of files, copy/convert from original location, via ALTER TABLE ... ADD PARTITION + INSERT ... PARTITION (...).
You can do all partitions keys as constants, and fill in WHERE clauses if necessary for data you know a priori.
The more you leave unconstanted in the PARTITION clause, the more flexible but also the more resource-intensive it is.
Different partitions could be different file formats, via ALTER TABLE ... PARTITION ... SET FILEFORMAT.

Takeaway

Lots of flexibility. The variations involve several different statements and clauses.

Examples

hdfs.log

What are the number ranges for Impala data types?

You can deduce basic ranges of *INT types by knowing the types are 1-, 2-, 4-, and 8-byte signed integers.
Plus and minus powers of 2, one bit saved for the sign, range a little farther on the negative side than the positive.
Different behavior if out-of-range values read from data vs. overflowing during computation vs. CASTing.
You can also call min_int() and max_int() functions to check programmatically.
DECIMAL is all in terms of 9s depending on scale/precision.
E.g. DECIMAL(5) = -99999 to 99999; DECIMAL(4,2) = -99.99 to 99.99. Max of 38 '9' digits.
FLOAT and DOUBLE are IEEE-style. You should be familiar from scientific computing to use effectively.
Beware SUM(), AVG() on huge numbers of FLOAT or DOUBLE values, or GROUP BY.

Takeaway

Use smallest practical numeric data type for each field.
Leave plenty of headroom for future data.
Use DECIMAL for floating point when precision more important than pure speed.

Examples

numbers.log

Isn't it great, I have this table with a million rows!

A million rows isn't substantial enough to be interesting for Big Data. Unlikely to be >1 HDFS data block.
A million rows is more like the "small" lookup table in a join operation. Like maybe every traffic light in the US that would be cross-referenced against the much larger data set of every traffic accident, red light ticket, cycle time, or speed measurement.
Or it's like metadata about every book in a library system. You can count the most popular authors or how many hardcovers there are vs. paperbacks vs. e-books, but you really want to cross-reference against every lending event to gain new insights (which genres take shortest/longest time to read, are there patterns in overdue fines).
A million rows might tempt you into experimenting with partitioning or other techniques for scalability, but again the volume is so small you won't learn anything useful.

Takeaway

Start with a billion rows, at least several GB, to learn anything useful about parallel execution on Hadoop.

Examples

What is wrong with using a million rows of data? (O'Reilly blog post)

How are joins different or special for Hadoop versus traditional databases?

The computer that figures out what rows are applicable from table 1 isn't necessarily the same computer that cross-references against the rows from table 2. Data is distributed between hosts according to blocks, not necessarily entire partitions.
The mechanism for coordinating how the data is read, processed, and transmitted is different depending on which side of the join is bigger. Which could be "more rows", "higher volume of data", "more rows matching the join condition and the WHERE clauses".
That's why COMPUTE STATS is so important.
Maybe you denormalize your data so you don't have to do so many joins.
Cut down the data being joined by joining results of a subquery or WITH clause.
Cross joins could be a really bad idea, that's why you have to request them.
And hints + STRAIGHT_JOIN if you still feel the join order isn't optimal.

Takeaway

This is a subject that is sort of familiar from traditional SQL, but people rarely had to examine it in such depth before distributed queries came along in Hadoop.

Examples

joins.log, cross_join.log

Where is my favorite feature from RDBMS X? Transactions, triggers, constraints...

Some of these are more OLTP-oriented, dealing with 1 row at a time rather than billions of new inserted rows. An RDBMS used for data warehousing would say to turn constraint checking and index maintenance off during data loading.
Since Impala deals with data files prepared by others, integrity checking has to be pushed back farther in the ETL cycle.
HDFS is append-only, so there is not the same easy undo as you get with rolling back a transaction.
Big data analytics tends to have defined windows for data loading and then things are stable for a while. If you are in a frequent-ingest scenario, that's beyond the scope of a "getting started" tutorial.
Of course ETL gets more convenient and reliable over time, as new features get added...
Forget about using EMPLOYEE and DEPARTMENTS tables for anything useful -- too small and OLTP-centric.

Takeaway

Unlearn any habits or mismatched expectations from OLTP days. Get into a business intelligence mindset.

Examples

N/A

How about indexes, where did those go?

An RDBMS would say not to use indexes for queries that read more than 20% of a table. Otherwise it's more work to seek back and forth in the index when you're reading all the data blocks anyway. This is a very common scenario for big data analytics queries.
HDFS blocks are so big, chances are you'll have to read all of them anyway.
Partitioning is sort of like a single major index used by all queries for filtering based on equality and ranges.
The "full table scan" isn't such a scary thing anymore. Move towards "full partition scan" to minimize the I/O for a particular query.

Takeaway

Shift performance mindset towards BI-style full table scans and partitioning.

Examples

partition_keys.log

What things could trip you up with numbers?

Arithmetic expressions might return a bigger type than you expect. For example, 2+2+2+2 is a BIGINT, has to be casted to "fit" into INT or smaller columns. (In some prior releases, 2+2 was also a BIGINT.)
Floating-point values aren't represented precisely. Implications for SUM(), comparison, GROUP BY.
DECIMAL is precise but slower than FLOAT or DOUBLE.
Out-of-range integers bump up against the edges of the range.
No unsigned types.
Difference in conversion behavior for out-of-range or wrong format values between CAST() and reading from disk.

Takeaway

Do experiments and reading to validate all your prior assumptions about numeric types.

Examples

numbers.log

What things could trip you up with dates and times?

Impala is relatively picky about the input format. (But has gotten more relaxed over time.)
UTC vs. local timezone aspect causes glitches going back and forth with Hive.
Impala TIMESTAMP dates go back as far as 1400.
EXTRACT() and TRUNC() are relatively new additions.
You use a TIMESTAMP column as a partition key, only on some other representation of it (possibly split apart). Run the TIMESTAMP through TRUNC() and/or EXTRACT() to get repeated values for grouping.
You might regret turning TIMESTAMP into STRING to use as a partition key.
You might regret doing GROUP BY on a TIMESTAMP column - too granular to be useful.

Takeaway

Validate your prior database assumptions regarding dates and times.
Be mindful of performance impact when dealing with individual date/time fields in tables with billions of rows.

Examples

date_format.log, partition_keys.log

What things could trip you up with strings?

By default, no length enforcement.
Don't recommend constrained VARCHAR or CHAR types for performance-critical stuff.
Only ASCII is reliable for built-in functions like LENGTH().
Impala can read a longer string from a data file than it can handle in INSERT.
Potential interactions between long strings and Parquet block size.

Takeaway

Take advantage of the unlimited-length STRING type where practical.
But don't expect to work with giant document-length strings.

Examples

create_data_types.log, insert_data_types.log

What things could trip you up with sorting?

ORDER BY always results in some extra work at the end, imposing a delay on the transmission of results back to you or your application. Be mindful when it's needed and when not.
ORDER BY in a view definition doesn't carry over to a query against the view.
ORDER BY in a subquery doesn't carry over to the outer query result.
ORDER BY with LIMIT and/or OFFSET changes the result set and so gets preserved for correctness, but still doesn't apply at the outermost level.
Results that happen to be in sorted order based on layout of data file are not guaranteed. (Especially when file > 1 block.)
DISTINCT doesn't sort its results.
Huge sorts could result in spilling to disk, slowing down overall query.

Takeaway

Impala is conservative about when to do sorts - only when needed for correctness.
Use ORDER BY consistently at outermost level.

Examples

sorting.log

You enter SELECT COUNT(*) FROM t1 and get an "unknown table" error. Why could that be?

Maybe you're in the wrong database. SELECT current_database(). USE. SHOW TABLES LIKE... SHOW TABLES IN... SELECT with a fully qualified table name.
Maybe somebody else dropped the table. No transactions, so a table could disappear out from under you.
Maybe you created the table in Hive. INVALIDATE METADATA t1.
Maybe your security settings don't let you see the table. Contact your administrator.

Takeaway

DDL in Impala has its own nuances to learn.

Examples

N/A; the basics of tables and databases illustrated elsewhere, other conditions are relatively rare.

I run a query and it seems slow. How do I figure out what to do?

Is the data big enough to be worth timing? Enough data blocks to parallelize?
Look at EXPLAIN plan to see how much data is being read, see if that makes sense. Maybe you'll learn you need to COMPUTE STATS or do some partition pruning.
Figure out how much data is being read from disk, do back-of-the-envelope calculations to figure out if the read rate or processing rate is out-of-whack. E.g. expect to read about 100 MB / second for each disk. Expect to process N rows per second per host for aggregation. (Understand only 1 core per host here.) Remember that you're expecting to process huge volumes of info, and the laws of physics apply in terms of how many bytes per second can be read and analyzed.
Look at SUMMARY. See if some host taking much longer than average for a particular phase. (Suspect skew or problem with host.) See if some phase taking the bulk of the time. See if anything jumps out at you about efficiency of that phase.
Look at PROFILE. See which phase(s) taking the longest on which host(s). Drill down into physical details of bytes read, memory used, bytes transmitted.
Read up on performance techniques: COMPUTE STATS, partitioning, Parquet file format.
Understand single-threading implications for some operations (aggregations and joins), versus multi-threaded scans.

Takeaway

There is an orderly process to follow for figuring out performance. You'll improve with practice.

Examples

explain_plans.log

How is X, Y, or Z represented in a text-format table? (Numbers, Booleans, date/times, NULLs.)

Variations

What about different delimiters?
Any support for optional enclosing quotes?
Optional header line(s)?

Answers

Do an INSERT and then take a look at the resulting text file.
Verbose syntax in CREATE TABLE. ALTER TABLE to set TBLPROPERTIES to change delimiter.
No. Either do a view to strip them off, or copy to a table with enclosing quotes removed. Not strictly "CSV" according to standard spec or custom serdes.
Not currently support for skipping header lines.

Takeaway

Text is the simplest format for ingestion. Simple, readable, easy to construct.
But for data at scale, you'll want it in Parquet format by the time you start querying.

Examples

text_tables.log

What could be bottlenecks slowing down a big query?

Admission control. (Affects large and small queries equally.)
A single node has some hardware or network problem slowing it down generally. (Less capacity, heavy load from non-Impala work, etc.)
A single node is doing more work than others.
Some complicated expression fools partition pruning because it isn't resolved to a constant.
Some complicated function is called billions of times and can't be codegen'ed.

Takeaway

Little inefficiencies become important when dealing with Big Data scale.
Ensure sufficient / consistent capacity.
Fold / simplify / materialize / denormalize where possible.

Examples

N/A

What are all the kinds of nonexistent or special values?

There's NULL.
The empty string is different from NULL. E.g. when you concat NULL with something else, result is NULL; not so for empty string.
In metastore, -1 signifies a missing value. E.g. in SHOW COLUMN STATS output.
Integer values sometimes wrap to largest/smallest value of applicable type.
Floats and doubles also have NaN and Inf.

Takeaway

There are some Hadoopisms in addition to the SQL NULL aspects.

Examples

special_values.log

What are the differences between my dev/test system and a production environment?

You might do some functional testing on a single-node system. Production will be a cluster.
Production cluster will likely be bigger. Potential different bottlenecks.
Production cluster might have lots of other databases and tables. Catalog scalability issues.
Production cluster likely to have security turned on. You'll need credentials, get privileges from administrator, might not have write access or even visibility into everything.
Production cluster likely to have concurrent Impala users. Higher memory, CPU, disk, network utilization.
Production cluster likely to have resource management and/or admission control. Queueing. Potential failure for big queries.
Production cluster likely to have concurrent non-Impala workloads. Memory limit considerations.
Production cluster likely to have more frequent / structured ETL. You'll access files produced by others, others will access your files. Possibly need to REFRESH periodically.

Takeaway

There's a whole other course's worth for production considerations.

Examples

N/A

Which of the following are OK? (Quoting identifiers.)

Questions

SELECT abc\_def from xyz\_123;
SELECT abc-def from xyz-123;
SELECT 'abc\_def' from 'xyz\_123';
SELECT "abc\_def" from "xyz\_123";
SELECT `abc\_def` from `xyz\_123`;
SELECT `abc\_def` from `db\_name.xyz\_123`;
SELECT `abc\_def` from `db\_name`.`xyz\_123`;
SELECT a#c from 123\_xyz;
SELECT `a#c` from `123\_xyz`;

Answers

OK.
Fails, because of dash in table name. abc-def could be a column expression.
Fails.
Fails.
OK.
Fails because the dot separator is inside the backticks.
OK.
Fails.
Fails. Table name can be created with backticks, column name can't be.

Takeaway

Practice with single quotes, double quotes, backticks.

Examples

quoting_identifiers.log

Which of the following are OK? (Quoting literals.)

Questions

SELECT c1 FROM t1 WHERE c2 = "hello world";
SELECT c1 FROM t1 WHERE c2 = 'hello world';
SELECT c1 FROM t1 WHERE c2 = `hello world`;
SELECT c1 FROM t1 WHERE c2 = hello;
SELECT c1 FROM t1 WHERE c2 = hello world;

Answers

OK.
OK.
Fails.
Possibly OK but misleading - hello treated as a column name.
Fails.

Takeaway

Practice with single quotes, double quotes, backticks.
Single quotes and double quotes work for strings; backticks don't.

Examples

quoting_literals.log

Which of the following are OK? (Aliasing identifiers.)

Questions

SELECT foo bar FROM baz bletch;
SELECT foo "bar" FROM baz "bletch";
SELECT foo `bar` FROM baz `bletch`;
SELECT foo AS column 1 FROM baz AS table 1;
SELECT foo AS 'column 1' FROM baz AS 'table 1';
SELECT foo AS "column 1" FROM baz AS "table 1";
SELECT foo AS `column 1` FROM baz AS `table 1`;
SELECT foo AS `column 1` FROM `database 1`.baz AS `table 1`;
SELECT * FROM (SELECT MAX(c1) AS biggest FROM t1);
SELECT * FROM (SELECT MAX(c1) FROM t1) AS biggest;

Answers

OK.
OK.
OK.
Fails. Column and table are reserved words.
OK.
OK.
OK.
Fails. You can't have a real database or table name with a space.
Fails. Subquery must be given an alias.
OK since subquery has an alias.

Takeaway

Practice with single quotes, double quotes, backticks, and aliases for columns, tables, and subqueries.
Impala is fairly flexible about how to quote aliases.

Examples

aliasing.log

Which of the following are OK? (Reserved words.)

Questions

SELECT c1 FROM t1;
SELECT c1 FROM insert;
SELECT c1 FROM "insert";
SELECT c1 FROM 'insert';
SELECT c1 FROM `insert`;
SELECT c1 insert FROM t1;
SELECT c1 AS insert FROM t1;
SELECT c1 FROM data;
SELECT c1 AS as FROM t1;
SELECT c1 FROM lower;
SELECT c1 FROM max;
SELECT `having` AS `group` FROM `insert`.`distinct`;

Answers

OK.
Fails, reserved word requires backticks.
Fails, reserved word requires backticks.
Fails, reserved word requires backticks.
OK.
Fails, reserved word requires backticks.
Fails, reserved word requires backticks.
Fails, reserved word requires backticks.
Fails, reserved word requires backticks.
OK, built-in function names are not reserved words.
OK, aggregate function names are not reserved words.
OK, can escape reserved words in column names, aliases, database names, and table names.

Takeaway

Keep an eye out for reserved words, when porting and upgrading.
Backticks are the only kind of quotes that work for reserved words.

Examples

reserved_words.log

In what circumstances are NULLs allowed in the schema definition?

All columns are nullable.
NULLs in text data files are represented as \N, not empty values or the string 'null'.
Newly added columns are NULL for all data files that don't have them.
Avro schemas need to define all columns as nullable.
There is a full set of nvl(), zeroifnull(), coalesce(), etc. functions for handling NULL values and substituting or skipping as appropriate.

Takeaway

The flexible nature of Impala schema definitions means that you can't practically enforce NOT NULL conditions. Design with the expectations that NULLs might show up anywhere.

Examples

nulls.log

Where can you find Impala?

The starting point for all things Impala is the Impala.io web site.
You can try it free on Cloudera Live with no installation.
There's the QuickStart VM for single-node experimentation.
It comes with any CDH 5 installation.
You can download it for CDH 4, but that's getting long in the tooth. Stick with CDH 5.
You can spin up instances under Amazon EC2.
You get it as part of Oracle Big Data Appliance.
MapR includes it in their distribution.

Takeaway

You can try out the functionality in lots of places. Some require more serious HW / SW commitment than others.

Examples

Some URLs to browse...

Cloudera Live

http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-live.html (requires signup)

Read-only demo: http://go.cloudera.com/cloudera-live.html (see link at bottom to skip this signup page)

How many ways are there to get a command prompt for Impala?

If you're logged onto any DataNode: impala-shell
If the long prompt bugs you: impala-shell -i localhost
If you're logged onto a non-Impala host: impala-shell -i <datanode>[:25000]
Use Hue.
Some people use Beeline?!

Takeaway

Once you know the hostname or IP, and optionally port, of any DataNode, you're ready to roll.

Examples

N/A. We'll see lots of impala-shell in other examples.

What do you need to know to get around in impala-shell?

End every command with a semicolon. Even quit.
Long commands, break across lines. Be ultra-careful though about mismatched quotes.
Up/down arrow to cycle through past commands.
Left/right arrow to cursor around and edit previous commands.
Readline shortcuts like Ctrl-A, Ctrl-E = beginning/end of line. Alt-B, Alt-F = backward/forward one word. (http://www.bigsmoke.us/readline/shortcuts)
Alt-click on Mac to make cursor jump, but only within same line.

Takeaway

impala-shell is pretty convenient for interactive experimenting and iterating. Just watch out for semicolons and quotes.

Examples

N/A. Try running and then modifying some CREATE/DROP TABLE statements.

What kinds of data types can I use in tables?

Integers - CREATE TABLE int_t (t TINYINT, s SMALLINT, i INT, b BIGINT, dec DECIMAL(10));
Floating-point - CREATE TABLE fp_t (f FLOAT, d DOUBLE, dec DECIMAL(5,3));
Strings - CREATE TABLE str_t (s STRING, v VARCHAR(64), c CHAR(5));
Dates/times - CREATE TABLE temp_t (ts TIMESTAMP);
Booleans - CREATE TABLE bool_t (b BOOLEAN);

Examples

create_data_types.log

How do I get data into a table?

INSERT...VALUES statement.
- Watch out! INSERT...VALUES not scalable or performant on HDFS.
- In real life, mainly use INSERT...SELECT to copy table-to-table, with expressions, functions, and CAST() as necessary.
- Also see LOAD DATA and HDFS-level ETL techniques in other examples.
Integers - literals; arithmetic expressions + CAST() if necessary; function results + CAST() if necessary.
Floating-point - literals; arithmetic expressions + CAST() if necessary; function results + CAST() if necessary.
Strings - literals; function results + CAST() if necessary.
Dates/times - string literals in right format; function results, sometimes chained.
Booleans - true / false literals; assertion expression results.

Examples

insert_data_types.log

How do I make a table be in a specific database?

CREATE DATABASE IF NOT EXISTS db_name;
USE db_name; CREATE TABLE tbl_name ...;
CREATE TABLE db_name.tbl_name ...;
CREATE TABLE tbl_name ... ; ALTER TABLE tbl_name RENAME TO db_name.tbl_name;
impala-shell -d db_name -q 'CREATE TABLE tbl_name ...'

Examples

table_in_db.log

When do I have to cast data types?

More often than you might expect. Impala will automatically up-cast, but not implicitly down-cast to avoid losing precision.
2+2+2+2 produces a BIGINT, so must be CAST() to a smaller type in e.g. INSERT.
Source table might have bigger types than dest table; CAST() during INSERT ... SELECT.
When inserting multiple values, the presence of a certain type forces all values for that column to be interpreted as that type.
If you don't know what type an expression returns, use it in a CTAS!

Examples

casting.log

How do I empty out an existing table?

No DELETE or TRUNCATE TABLE statements...
INSERT OVERWRITE SELECT <query matching no rows>
INSERT OVERWRITE SELECT <query> -- Empties + puts new data in.
'hdfs dfs' operations to remove HDFS files + REFRESH <table_name>.
ALTER TABLE <table_name> SET LOCATION '/hdfs/directory/with/no/data/files'
ALTER TABLE SET FILEFORMAT <different_format_than_the_data_files> -- won't be able to see any rows

Examples

make_table_empty.log

How do I work with different date formats?

SELECT now() to see format for current date and time.
<timestamp> +/- INTERVAL n [DAYS|HOURS|YEARS|etc] to produce delta values.
INSERT now()-themed values into a text table, examine the output file.
Combination of unix_timestamp() to produce a number, then from_unixtime() to produce a formatted string.

Examples

date_format.log, Impala docs

You have data with dates. How can you partition by year, month, and day?

If you have a finer-grained date/time value, you'll have to keep that and synthesize a new column with just the date to use as the partition key. Hopefully you know the date for all data files already based on the ETL pipeline.
Represent the whole thing as a single STRING column like '2015-02-18'. Bad idea - prefer numbers for partition keys, rather than strings.
Represent the whole thing as a single INT column like 20150218. Not bad. You can zero in on a particular day or range and Impala will do partition pruning.
Represent the whole thing as a single TIMESTAMP column like '2015-02-18'. Unusual but has potential. You would need to be careful to truncate each value to avoid a huge number of different partitions (for each different hour/second/fractional second).
Split the individual fields into separate STRING columns like '2015','15','18'. Bad idea. 3x the overhead of using a string as a partition key. Lots of string operations like concatenation to match individual dates or ranges in the WHERE clause. Compatible with Hive in terms of leading zeros in month/day directory names like '05'.
Split the individual fields into separate SMALLINT and TINYINT values like 2015,15,18. Good performance, minimal memory usage compared to STRING. (SMALLINT and TINYINT use less memory than INT?) Better able to use dynamic partitioning, e.g. specify year and month in INSERT...PARTITION but leave day blank. Easier to filter, e.g. conditions like 'WHERE month IN (11,12,1)'. Three levels of directories to manage in HDFS. Potential mismatch with Hive directory names which include leading zeros.

Takeaway

Choice of partition key type and granularity is an important decision to think through and test performance and scalability with realistic data volumes.

Examples

partition_keys.log

How can I interface with Impala data through other Hadoop tools?

DESCRIBE FORMATTED <table> always shows you the HDFS directory where table data lives.
If you add or delete data in HDFS outside of Impala, run REFRESH <table_name>.
Use LOAD DATA to move externally created files into an Impala table (starting from HDFS).
Use CREATE EXTERNAL TABLE to point to data files on HDFS outside Impala's control.
Use ALTER TABLE ... SET LOCATION to point to a different HDFS path at any time. Often useful to point a single partition to a specific path.

Takeaway

It is easy to switch back and forth between Impala and other Hadoop tools, and interchange data.

Examples

hdfs.log, text_tables.log

How do I script Impala from the shell?

impala-shell -d <database> to start or execute in a particular database.
impala-shell -q '<query>' to run a single statement.
impala-shell -f '<file>' to run a sequence of statements, or feed SQL code to standard input.
impala-shell --quiet to suppress messages for query timing and echoed commands.
impala-shell -B to produce delimited result sets for queries rather than pretty-printed tables.
- Very important for avoiding impala-shell overhead during benchmarks with big result sets.
Embed a HERE document to do variable substitution in the SQL code.
Use ! in impala-shell to shell out to a Linux command.

Examples

A HERE document...

#!/bin/bash

export DB_NAME=tpc
export TABLE_NAME=customer_address
export CRITERIA=Oakland
export CUTOFF=20

impala-shell -B -d $DB_NAME <<EOF
select * from $DB_NAME.$TABLE_NAME where ca_city = '$CRITERIA' limit $CUTOFF;
EOF

What are all the ways I can understand the performance and scalability of a query?

EXPLAIN <query> to get plan and estimates ahead of time. Lets you avoid embarrassment of accidentally running a resource hog query that bogs down the cluster.
SET EXPLAIN_LEVEL=<level> to see more or less detail.
SUMMARY to get a concise view of the phases, to look for unusually high timing or memory usage, or outliers.
PROFILE to get a detailed view of everything that happened on every node.

Takeaway

You can choose whichever level of detail you need to feel comfortable or do tuning and troubleshooting.

Examples

explain_plans.log

What analytic functions does Impala have?

DENSE_RANK, FIRST_VALUE, LAG, LAST_VALUE, LEAD, RANK, ROW_NUMBER.
OVER and window clauses. Limited RANGE support in window clauses (current row + unbounded in one or both directions).
If you use these for scenarios like time series, stock ticker, etc. you probably already have them in your code.
Use ROW_NUMBER and DENSE_RANK to synthesize extra columns as unique IDs or category IDs.
These function calls are rewritten internally as join queries.

Takeaway

This is a relatively new Impala feature that opens up new use cases. Dive in and try these functions.

Examples

normalize.log, Impala docs

How do you approach a table you don't understand very well?

First you find it: SHOW TABLES, SHOW TABLES in <db>, SHOW TABLES IN <db> LIKE '*wildcarded_name*'.
Then you take a look at its structure: DESC <table>.
Then you look a little deeper at its innards: DESC FORMATTED <table>.
You probe the physical data files with SHOW TABLE STATS.
If it's partitioned, you examine the partitions with SHOW PARTITIONS. (Actually, this is the same info as SHOW TABLE STATS.)
You see if the table has been analyzed with SHOW COLUMN STATS.
Knowing the volume of data in the table, you gently start probing its contents. (Lots of LIMIT clauses and NDV functions.)

How many databases are there in your Impala instance?

All instances start out with 2, DEFAULT and _IMPALA_BUILTINS.
See all the names with SHOW DATABASES.
If it's a long list, find a particular one with SHOW DATABASES LIKE '*wildcarded_name*'.

Takeaway

Impala has a flat database structure, not divided by owner. There's usually a sweet spot between a single overstuffed database and a bazillion tiny ones.

Why could you access a column in a table today but not tomorrow?

Someone could have done ALTER TABLE ... DROP COLUMN so the column no longer exists.
Someone could have done ALTER TABLE ... CHANGE <old_name> <new_name> <type> and now the column has a different name.
There could be a new Impala release and now <col_name> is a reserved word. (Escape it with backticks to work around.)
What you think of as <table_name> could really be a view, and the view was redefined to access fewer columns. Sometimes views are used like this for security, to restrict access to sensitive columns in a table you can't query directly.

Takeaway

Hadoop generally is lenient about schema evolution. Impala follows in that same spirit. Be prepared for names and types to change.

What are the considerations for file formats?

Text is the easiest and most flexible (hence the default), but also the bulkiest.
Compressed text workable, but not generally a good idea due to non-splittability.
Avro, RCFile, SequenceFile all queryable by Impala. More compact than text but still not ideal for data warehouse-style workloads.
Parquet the preferred format for real-world scenarios:
Columnar store efficient for analytic queries.
Lots of encoding and compression for compactness.
Widely adopted, supported by other Hadoop components.

Takeaway

Divide attention between text for ease of ingest and learning with tiny tables, and Parquet for anything realistic and scalable.

Examples

file_formats.log, Impala docs