Thursday, June 05, 2014
I'm reading: Elasticsearch Architecture Design - Considerations for using Elasticsearch in custom solutionsTweet this !
Elasticsearch has it's own use cases for which one may want to consider the same as one of the technology stack in the solution. While considering elasticsearch in the scope of the solution, there are certain aspects of the product which should be kept in view from a solution design perspective.
Below is a list of some of the most important architecture design considerations from my perspective.
1) Elasticsearch is a document oriented database system. Document oriented stores are useful where the need is to have a scalable system and queries are driven by content of the data and not by the key which is the case in key-value stores.
2) Elasticsearch stores data in the form of JSON documents. JSON has limited set of datatypes. So if the need is to have a very strongly and uniquely typed system, one should consider the data migration or storage strategy wisely.
3) Elasticsearch is a database system, that exposes REST based APIs for data as well as server administration. This means that elasticsearch can be seen as a DB server as well as a Web server. So when the infrastructure landscape is being designed, one should carefully consider whether to place elasticsearch in web zone or in db zone.
4) Elasticsearch does not ship with any authentication module as of date. There are some community plugins available though. Keeping this point in view, one would want to keep elasticsearch behind a web application and not expose it directly over internet.
5) Elasticsearch does not ship with any GUI tools or editors for development purposes. For the same, elasticsearch has a concept of plugins, which can be developed using Elasticsearch APIs. A huge number of web based plugins for elasticsearch are available on GitHub, which can be easily installed and used for development and data analysis purposes.
6) Elasticsearch front-end programming wrappers are quite popular among developers as they provide the familiar syntactical support and eliminating the need for developers to learn elasticsearch API. For example, NEST is a .NET wrapper on the top of elasticsearch API. The risk with such wrapper frameworks is that they should continuously update their API to be compliant with elasticsearch api. Elasticsearch has a very aggressive release schedule and typically there are minimum 3-4 releases of the product every year, some of which also has breaking changes.
7) Elasticsearch like many other NoSQL products has the characteristics of default behavior and automated data management, until it is explicitly overridden. For example, if a document is inserted with Id - 1, and then if again the same document is inserted in the same index and tyep, it won't raise an exception like relational databases do. It would silently update the document and increment the version number of the document. Another example, elasticsearch would manage any document on any shard which may reside on any node. So until and unless, a specific node is asked, any document can end up on any machine, node and shard.
8) Elasticsearch is a Multi Version Concurrency Control (MVCC) system. This means that any document is never updated actually. Whenever an update request is received, elasticsearch inserts a new record with an increment version number. Though one can configure purging of the older versions, this characteristic of the system can result in high storage requirements, if there are huge and frequent updates on the system.
9) Elasticsearch has a concept of "rivers" for integrating elasticsearch with external systems like CouchDB, Twitter etc. But out-of-box, it does not support any rivers to import data from elasticsearch in SQL Server, Oracle, MySQL, DB2 and other relational databases. There are community plugins like JDBC River which can help in one time imports of data. But for continued extraction and loading of data into elasticsearch, a distributed ETL system like Apache Kafka or Twitter Storm would be advisable.
10) Elasticsearch has some very rigid settings related to data management, which should be taken care even before creating the system. For example, once an index is created, by default it is configured with 5 primary shards. Once the shards are created, throughout the life of the index, the number of shards can never be changed at all. So in case you were planning for 50 million records that you would manage with 5 shards, and say the requirements changed and you now need to load 150 million, still you would have to manage 150 million records with 5 shards only.
These are some of the initial set of considerations to keep in view while fitting elasticsearch in your solution. In the time to come, I would share another part of this architecture design article, with more points to consider.