Monday, June 16, 2014

SQL Server vs MongoDB vs MySQL

I'm reading: SQL Server vs MongoDB vs MySQLTweet this !

Microsoft SQL Server is one of the mainstream databases used in most operational systems built using Microsoft technology stack. One of the biggest shortcoming is the inability to support horizontal scaling / sharding. So the next logical choices that are most nearest to SQL Server would be MySQL.

In case you are looking for horizontal scaling / sharding, that would mean that you are gearing up to deal with Big Data. MongoDB is the arguably the first logical step in NoSQL world, in case if someone is considering to experiment with NoSQL to handle BigData.

At the stage, one is faced with the requirement to compare all these databases. Below is a quick comparison of these databases, with limitations highlighted in red and product strengths in blue.


Reference: DB-Engines.com

Saturday, June 14, 2014

Elasticsearch vs Solr vs Endeca vs Sharepoint FAST vs Google Search Appliance ( GSA ) vs Autonomy vs Semaphore

I'm reading: Elasticsearch vs Solr vs Endeca vs Sharepoint FAST vs Google Search Appliance ( GSA ) vs Autonomy vs SemaphoreTweet this !
Enterprise Search is a huge market. Fortunately there are just a handful of products out there to cater this business and unfortunately there is no one-product-fits-all kind of product out there.

There are specific category of features expected from an enterprise search product, which makes it suitable for one or other requirements. Some of them are listed as below:

1) Crawling
  • Web Crawling: An enterprise has most of the content on portals in the form of html and media documents. A crawler is the basic means to create an index out of this content.
  • DB Crawling: Data stored in databases often needs to be crawled or imported into the search inventory.
2) Taxonomy

Taxonomy is the logical organization of content in the enterprise content management system. Some term it as metadata or structure or term stores of the index maintained in the system.  It's the method of framing structure around the content, so that information can be retrieved more effectively and precisely.

For example, a very simple way of implementing taxonomy can be the ability to tag content using a set of keywords defined centrally at the organization level.

3) Specialized OOB Search
  • Faceted search (like the ones when you use Amazon and a set of categories appear of the left side)
  • Dictionary based search (where you look for a word and its synonyms)
  • Auto-suggest (for example when you type terms in google and it suggest few phrases)
4) Plugability 
  • Ability to index SMTP server
  • Ability to index LDAP server
  • Out-of-box ability to index any such external systems
Systems like Google Search Appliance, Oracle Endeca, HP Autonomy, Microsoft Sharepoint search, and Solr are the top leaders in this category. Products like Smartlogic Semaphore add a value added layer on the top of it.

But the big question is where does products like Elasticsearch fit here ? 

While we looked at the positives of these products due to their ability to provide the above mentioned features, there are some downsides / limitations too, where Elasticsearch or even Solr steps in.

1) Any of these products are not economic. For example, HP Autonomy is heard to have the base price of more than half a million dollars. Every enterprise may not have the budget to afford it.

2) Some products do not support database indexing easily. For example GSA does not allow to use complex delta detection based queries for indexing data from databases easily.

3) Most of these products are not scalable horizontally. Apart from appliance solutions, products like endeca are resource intensive and not suitable for managing big data kind of volumes due to their scalability architecture.

4) Custom development for extending the product using APIs is not as easy as compared to open source products.

Custom search for applications is inevitable. Though the enterprise search platform may be dominated by these products, but for empowering custom applications that manage big data using specialized search functionality (for example ecommerce sites like amazon.com and others), products like elasticsearch and solr would continue to find its space.

The limitations with products like Elasticsearch is that it lacks the enterprise scale features for example OOB Crawlers, Information Visualization and Reporting layers required for e-discovery and reporting, and very limited taxonomy which is very crucial for an enterprise search platform. But as the product is still very young and evolving, these features can be expected hopefully over the couple of years.

Thursday, June 12, 2014

Elasticsearch with .NET : NEST Library Code Example

I'm reading: Elasticsearch with .NET : NEST Library Code ExampleTweet this !
Elasticsearch can be used with a number of programming languages, one of it being Microsoft .NET. Elasticsearch.NET (low level client) and NEST (high level client). 

NEST comes with a strongly typed wrapper around Elasticsearch.NET API, and allows for a fully object oriented programming approach to interface with Elasticsearch. It also has nice documentation to learn the APIs. 

The first program that I would want to generally write, is to index a structured document into elasticsearch using C# code and NEST APIs. One only needs any version of Visual Studio and NEST Nugget package installed. Below is the very first console application I wrote to test the .NET integration with Elasticsearch. Let me know whether you liked the code, whether it worked for you, and in case if you need any help with programming.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using Nest;
using Nest.Domain.Connection;

namespace ESConsole
{
    class Program
    {
        static void Main(string[] args)
        {
            var uri = new Uri("http://localhost:9200");
            var settings = new ConnectionSettings(uri).SetDefaultIndex("contacts");
            var client = new ElasticClient(settings);
            

            if (client.Health(HealthLevel.Cluster).ConnectionStatus.Success)
            {
                Console.WriteLine("Connection Successful");
                
                if (client.IndexExists("contacts").Exists)
                {
                    Console.WriteLine("Index Exists");
                    Program.UpsertArticle(client, new Article("The Last Airbender", "Siddharth"), "blog", "article", 1);
                    Program.UpsertContact(client, new Contacts("Siddharth Mehta", "India"), "contacts", "contacts", 2);
                    Console.WriteLine("Data Indexed Successfully");
                }
                else
                {
                    Console.WriteLine("Index Does Not Exist");
                }
                
            }
            else
            {
                Console.Write("Connection Failed");
            }

            Console.ReadKey();

        }

        public class Article
        {
            public string title { get; set; }
            public string artist { get; set; }
            public Article(string Title, string Artist)
            {
                title = Title; artist = Artist;
            }
        }

        public class Contacts
        {
            public string name { get; set; }
            public string country { get; set; }
            public Contacts(string Name, string Country)
            {
                name = Name; country = Country;
            }
        }

        public static void UpsertArticle(ElasticClient client, Article article, string index, string type, int id)
        {            
            var RecordInserted = client.Index(article, index, type, id).Id;
                        
            if (RecordInserted.ToString() != "")
            {
                Console.WriteLine("Transaction Successful !");
            }
            else
            {
                Console.WriteLine("Transaction Failed");
            }
        }

        public static void UpsertContact(ElasticClient client, Contacts contact, string index, string type, int id)
        {
            var RecordInserted = client.Index(contact, index, type, id).Id;

            if (RecordInserted.ToString() != "")
            {
                Console.WriteLine("Transaction Successful !");
            }
            else
            {
                Console.WriteLine("Transaction Failed");
            }
        }
    }
}

Monday, June 09, 2014

Elasticsearch with SQL Server

I'm reading: Elasticsearch with SQL ServerTweet this !
Elasticsearch is a very powerful value addition to any relational dbms like SQL Server, Oracle, DB2 etc, provided it's used wisely. Before we look at how to use elasticsearch with SQL Server, we should look at "Why to use elasticsearch with SQL Server". This question holds the key to the answer.

SQL Server hold data either in relational form or in multi-dimensional form (through SSAS). Full Text Search (FTS) in SQL Server is capable of providing some out-of-box search feature, but when search queries requires exhaustive searching over huge datasets, and add some complexity in the search definition itself, one can evidently see performance impact there. Elasticsearch is primarily a search engine, but loaded with features like Facets and Aggregation framework, it helps solve many data analysis related problems. For example, everyone of us would have visited sites like Amazon.com, Ebay.com, Flipkart.com etc. Whenever we search for a product, it builds all the dynamic categories, ranges and values on the fly. For such features, a product like elasticsearch can be extremely helpful. One such real project example can be read from here.



How to use Elasticsearch with SQL Server ?


Elasticsearch JDBC River is the best means (to the best of my knowledge as of date) to load data from SQL Server into an elasticsearch index. One of the best explanations on setting up elasticsearch JDBC river with SQL Server, can be read from here.

One point to keep in view is that, if you setup a river and you restart elasticsearch server, the river would execute the query set for the river again. This could result in reloading of the entire data in the index. In case if the IDs are being fetched from the source, all existing records would get updates. But if IDs are autogenerated in elasticsearch, this would result in new records, which would ultimately lead to duplicate data. So use the river cautiously. You can also delete the river once data is loaded into the index, in case its a one time activity for one time data migration.

Thursday, June 05, 2014

Elasticsearch Architecture Design - Considerations for using Elasticsearch in custom solutions

I'm reading: Elasticsearch Architecture Design - Considerations for using Elasticsearch in custom solutionsTweet this !
Elasticsearch has it's own use cases for which one may want to consider the same as one of the technology stack in the solution. While considering elasticsearch in the scope of the solution, there are certain aspects of the product which should be kept in view from a solution design perspective.

Below is a list of some of the most important architecture design considerations from my perspective.

1) Elasticsearch is a document oriented database system. Document oriented stores are useful where the need is to have a scalable system and queries are driven by content of the data and not by the key which is the case in key-value stores.

2) Elasticsearch stores data in the form of JSON documents. JSON has limited set of datatypes. So if the need is to have a very strongly and uniquely typed system, one should consider the data migration or storage strategy wisely.

3) Elasticsearch is a database system, that exposes REST based APIs for data as well as server administration. This means that elasticsearch can be seen as a DB server as well as a Web server. So when the infrastructure landscape is being designed, one should carefully consider whether to place elasticsearch in web zone or in db zone.

4) Elasticsearch does not ship with any authentication module as of date. There are some community plugins available though. Keeping this point in view, one would want to keep elasticsearch behind a web application and not expose it directly over internet.

5) Elasticsearch does not ship with any GUI tools or editors for development purposes. For the same, elasticsearch has a concept of plugins, which can be developed using Elasticsearch APIs. A huge number of web based plugins for elasticsearch are available on GitHub, which can be easily installed and used for development and data analysis purposes.

6) Elasticsearch front-end programming wrappers are quite popular among developers as they provide the familiar syntactical support and eliminating the need for developers to learn elasticsearch API. For example, NEST is a .NET wrapper on the top of elasticsearch API. The risk with such wrapper frameworks is that they should continuously update their API to be compliant with elasticsearch api. Elasticsearch has a very aggressive release schedule and typically there are minimum 3-4 releases of the product every year, some of which also has breaking changes.

7) Elasticsearch like many other NoSQL products has the characteristics of default behavior and automated data management, until it is explicitly overridden. For example, if a document is inserted with Id - 1, and then if again the same document is inserted in the same index and tyep, it won't raise an exception like relational databases do. It would silently update the document and increment the version number of the document. Another example, elasticsearch would manage any document on any shard which may reside on any node. So until and unless, a specific node is asked, any document can end up on any machine, node and shard.

8) Elasticsearch is a Multi Version Concurrency Control (MVCC) system. This means that any document is never updated actually. Whenever an update request is received, elasticsearch inserts a new record with an increment version number. Though one can configure purging of the older versions, this characteristic of the system can result in high storage requirements, if there are huge and frequent updates on the system.

9) Elasticsearch has a concept of "rivers" for integrating elasticsearch with external systems like CouchDB, Twitter etc. But out-of-box, it does not support any rivers to import data from elasticsearch in SQL Server, Oracle, MySQL, DB2 and other relational databases. There are community plugins like JDBC River which can help in one time imports of data. But for continued extraction and loading of data into elasticsearch, a distributed ETL system like Apache Kafka or Twitter Storm would be advisable.

10) Elasticsearch has some very rigid settings related to data management, which should be taken care even before creating the system. For example, once an index is created, by default it is configured with 5 primary shards. Once the shards are created, throughout the life of the index, the number of shards can never be changed at all. So in case you were planning for 50 million records that you would manage with 5 shards, and say the requirements changed and you now need to load 150 million, still you would have to manage 150 million records with 5 shards only.

These are some of the initial set of considerations to keep in view while fitting elasticsearch in your solution. In the time to come, I would share another part of this architecture design article, with more points to consider.

Sunday, June 01, 2014

Elasticsearch Tutorial - Elasticsearch Storage Architecture : Analysis and Inverted Indexes

I'm reading: Elasticsearch Tutorial - Elasticsearch Storage Architecture : Analysis and Inverted IndexesTweet this !
Elasticsearch use the Apache Lucene engine for almost all of its operations. One of the primary differences between relational databases and NoSQL systems is the way it stores data. When it comes to the storage architecture of elasticsearch, there are two terms which are key to the storage mechanism - Analysis process and Inverted Indexes.

What is Analysis process in elasticsearch ? 

In Part 1, I already explained what's a tokenizer and filter in elasticsearch. Whenever an index is created, a default mapping and analyzer would be attached to it. Depending on the config of the analyzer, a tokenizer and filter would be configured for the same.

When a document request for indexing is received by elasticsearch, which in turn is handled by lucene, it converts the document in a stream of tokens. After tokens are generated, the same gets filtered by the configured filter. This entire process is called the analysis process, and is applied on every document that gets indexed.

Below is an example of the analysis process. Consider an html tag with embedded sentence as the document as the input. When the same passes through a set of filters and tokenizers, it gets converted into a set of tokens, which finally gets indexed.



What is the storage data structure of elasticsearch ? Inverted Indexes

In SQL Server, we have a binary tree as the data structure for an index, for example. Post the analysis process, when the data is converted into tokens, these tokens are stored into an internal structure called inverted index. This structure maps each unique term in an index to a document. This data structure allows for faster data search and text analytics. All the attributes like term count, term position and other such attributes are associated with the term. Below is a sample visualization of how an inverted index may look like.

Post the tokens are mapped, document is stored on the disk. One can choose to store the original input of the document along with the analyzed document. The original input gets stored in a system field names "_source". Once can even choose to not analyze the input, and store the document without any analysis. The structure of the inverted index totally depends upon the analyzer chosen for indexing.


Summary: One thing to learn from this is that the key to an efficient storage and retrieval process is the analysis process defined on the index, as per the application needs.
Related Posts with Thumbnails