NDBCluster Engine does not store to local disk. It pushes data to data nodes, which don’t need to be running MySQL.
Data nodes are ndbd processes and not necessarily bound to a single server. For testing, you can run multiple on the same server.
Data usually stored in 2 data nodes. Storing in 3 is experimental feature, requires more servers and performs worse. Storing in 1 obviously reduces availability, but gives great performance.
Storage is transactional.
You can have many mysqld instances all accessing the same data nodes. You can also use NDB API to access data nodes.
A hash is used to partition data across data nodes.
Always at least 1 replica of stored data. Node groups allows you to group replicas. Group selection is done according to order the server is listed in config file.
If a node goes down, its replica takes over as primary for the data. When node comes back, if not too far behind, it will sync the changes. If too far behind, it will sync a dump of all data.
Or you can bring up a replacement data node server with same IP address. Config is by IP.
If 2 data nodes in a group goes down, cluster must halt. Consistency over availability, CAP-wise.
SQL Node is a node running MySQL server.
Up to 48 data nodes. Very rarely more than 10 in practice, due to network and maintenance issues. Usually just a few.
Management node started first and it reads in the config. All data nodes need it in order to start. Also used for monitoring and arbitration.
If 2 data nodes can’t communicate, but 1 can still reach management node, that data node will take over. The one that can’t reach the mgmt node will shut down.
Mgmt node should be on different physical machine than the data nodes. You generally want two management nodes.
Only one will be the arbitrator. If primary fails, the data nodes will elect the 2nd mgmt node as the arbitrator.
There are also API nodes. A SQL node is a special case of an API node. API nodes use C++ library. Can also use Cluster/J in 7.1 for Java apps.
Only have to create tables on one MySQL server. The table creation will be auto detected by other servers.
User and grant info is independent, so that must be done on each server.
But, you can set up replication between MySQL servers. This allows you to more easily mix in non-cluster engine tables.
Need to handle retries in app. If data node dies during transaction, app must retry so transaction goes to other data node.
Table limits: 8052 bytes per row (can use blobs, but perf not good), 128 columns per table.
Indexed data limited to available RAM. Non-indexed data can be stored on disk.
Features: Online backup, global replication (can use MySQL to replicate clusters), NDB & MGM API, online additions of new data nodes.
Estimated query plan. Only for SELECT. Does not generate byte code. Unfortunately, you can’t ask MySQL for the plan it actually used for an executed query.
MySQL does everything as nested loop JOINs. Even a single table query is a single iteration loop. Always a left-deep tree, though it may reorder joins within the tree.
EXPLAIN output starts with deepest JOIN.
If no subquery or UNION, id for each row will be 1.
For select_type, ‘simple’ means only one SELECT keyword in entire query.
Derived means executed as a temp table, e.g., result of a subquery. If the table column is , the n is the subquery’s id column.
Union – rows spooled into temp table then read out with null in a row that says UNION RESULT
Be aware that EXPLAIN actually executes subqueries and unions, which can make it run very slowly. Also, can cause side effects if you update variables.
For derived, the higher id row is executed first. For UNION, the output tells you the execution order, e.g., .
IN and NOT IN queries are poorly optimized. They are converted into a dependent subquery.
UNCACHEABLE refers to the item cache. Not well documented. A few non-deterministic things like RAND() or a user variable can cause this.
Type is the JOIN type. Really more of an access type. Worst to best is ALL, index, range, ref, eq_ref, const, SYSTEM, null.
eq_ref means if there is a match, there will only be one row, perhaps due to a unique constraint on an index
Some non-range clauses get labeled as range because of details of the optimizer implementation.
ref tells you which columns or constants from preceding tables are used for lookups in the index named in the key column
rows is number of estimated rows that must be read. 5.1 tries to use LIMIT clause, but it is still sometimes inaccurate.
filtered: # of rows satisfying a condition. Added in 5.1. Not very useful.
extras: useful extra info (using index -> covering index; using where -> server post filters rows after using index; using temporary -> implicit temp table but not specified if memory or disk, using filesort -> means some kind of sort, but could be in memory)
filesort is an algorithm name. It doesn’t mean a disk-based file was used.
Use “pager mk-visual-explain” to pipe output through mk-visual-explain, assuming you installed Maatkit.
Standard driver is MySQLdb. Another option is oursql.
If using MySQLdb library – db.connect(use_unicode=True, charset=’UTF8′)
The oursql driver has same API. Uses mysqlclient. Also implemented in C.
Both depend on MySQL client already being installed.
MySQL Connector/Python doesn’t depend on MySQL client libraries. Written by Geert.
Connector/J can be used with Jython. from java.sql import DriverManager
Current version is 0.1.3, but 0.1.4 coming soon.
Performance for INSERTs is about twice as slow as for MySQLdb or oursql.
Need more benchmark tests, since not clear whether that was including the network roundtrip time, since that should be roughly the same.
After the talk I talked to Chromakode about drivers and he recommended oursql. Much better code and documentation. I’ve switched to oursql and I like it.
Better Database debugging for shorter downtimes – slides
Tracking changes over time
Slowly Changing Dimension type-2. Start with a baseline and only track changes.
Need PK of original object, maybe from information_schema. Select attributes to track.
Add new row for each new property and when properties change. Use current_flag column to indicate which version is current. Also use effective_date and end_date, which can also tell you the current version.
Job logging via stored procedures
When they run a job (e.g., crontab entries), they use stored procedures to track when and how long it ran.
Really need to review slides and the stored procedures to get much value from this.
Linux Performance Tuning and Stabilization Tips – slides
To minimize mem usage, use compact datatypes. Remove unused indexes. Archive data when possible to minimize your working set.
Use direct IO -> innodb_flush_method=O_DIRECT.
Don’t alloc too much memory to mysqld. Makes filesystem cache too small and causes swapping.
If you disable swap & run out of memory, OOM killer will likely kill mysqld. It takes a long time to kill it and it aborts the process, which triggers crash recovery.
MyISAM, backup tools & admin tools use the filesystem cache.
Copying large files can cause swap. This can happen during backup.
set vm.swappiness=0 in /etc/sysctl.conf. Default is 60. This makes OS prefer making filesystem cache small over swapping.
tcmalloc faster than malloc. Can config InnoDB to use tcmalloc. Mostly matters for CPU bound loads with large buffer pool.
Don’t use too much session memory This can happen if you make per session memory variables to big. read_buffer_size can cause large memory blocks to be allocated.
Use BBWC on raid cards. Then, disable write cache on disks.
Don’t set write barrier if using BBWC (in ext3, barrier = 0)
Appending+fsync much slower than updating+fsync
If you use sync-binlog=1, fsync is called on append. Very slow.
Set innodb_autoextend_increment to higher than default of 8, maybe use 20.
pdflush takes care of async writes for MyISAM, file copy, mysqldump, etc.
Facebook moved from ext3 to xfs. ext3 very slow to delete large files. Makes dropping large table slow.
Writing to file is serialized.
Use dir_index to speed up searching and disable write barrier.
With xfs, disable write barrier with “nobarrier”. Concurrent writes possible with O_DIRECT.
Fusion SSD cards are 2-3 times faster than Intel SSD cards and 10-20 times faster than HDD. Fusion cards really benefit from xfs.
I/O Scheduler default is cfq, but noop or deadline are often better. InnoDB is already doing I/O scheduling itself, so cfq results in useless overhead.
Big difference in write performance between cfq and noop when lots of random reads mixed with writes. Recommend noop.
I/O scheduler queue size. Default is 128. Setting large like 100000 can greatly improve perf for MyISAM, since it doesn’t sort writes. Doesn’t help InnoDB, since it already sorts them.
Check r/s, w/s, svctm, %util, though svctm more reliable than %util
mpstat gives per core stats, vmstat averages across cores
mpstat -P ALL 1
SystemTap can be used with MySQL binary builds with DWARF symbols