Table of Contents
Challenges of healthcare data
Healthcare data makes up a third of the world’s data and is projected to grow, in the next few years, at a faster pace than traditional data-rich industries like financial services and manufacturing. The staggering data volumes in healthcare, in addition to its heterogeneity and fragmentation, represent substantial challenges to extracting insights for improving healthcare outcomes of individuals and communities.
At Carelon, a healthcare services company dedicated to making it easier to deliver whole-person health, we have undertaken various initiatives to address these challenges — most notably full-text search of healthcare claims. In this post, we share with you our automated deployment and configuration solution for ELK stack. We use ELK stack internally at our organizational unit at Carelon to power real-time search capabilities on top of healthcare claims data lake, reaching peak performances of single digit seconds on a multi-query execution against an index with billions of records.
Existing healthcare data paradigm
Nowadays consumer-facing applications have search functionality in more than one place. There is an expectation that this functionality can serve on-demand, accurate, and real-time results. This is achieved via real-time (or near real-time) synchronization of the application data with a search index, backed by a NoSQL analytics database like Elasticsearch.
Other types of applications are internal-facing enterprise apps, which serve diverse communities of business leaders, subject matter experts, and data-centric teams. Here, the need is to access on-demand and real-time insights and seamlessly explore multiple “what if” business scenarios.
Much of the operational data from enterprises today is transactional in nature. This data is always stored in a relational database in a centralized data lake.
Traditionally, enterprise analytics applications rely heavily on SQL for preparing the data, and quite often that SQL logic is the end-result of a long and iterative process. An alternative to the SQL-only approach is to implement the best of both worlds: a data lake with a highly customized “Search Index” layered on top. This search index embeds answers to the most-encountered use cases by the business and analytics teams.
Within our approach, the complex and ever-evolving business questions described by the SQL would be run automated at scale and stored in the search index. The results of this SQL are then consumed many times over by everyone with ease via pre-built user interfaces.
Furthermore, real time data synchronization between the data lake and the search index is not that critical compared to the ability to execute automated, fast, and high throughput data pipelines. These pipelines push data with ever-evolving mappings to reflect new business rules (to be described in a follow-up blog post) or even completely refresh the search index.
This is a great example where a search and analytics solution goes hand in hand with the relational database management system (RDBMS) business transactions data store and data lake at the enterprise.
Real-time, scalable, and secure data at Carelon
At Carelon, we achieve real-time search capability on an index with billions of records with complex mappings on an Elasticsearch cluster spanning dozens of data nodes (with over 1K CPUs). In addition, working in an environment governed by stringent security and compliance policies motivated our need to develop an in-house Ansible deployment for Elasticsearch on “bare-metal” compute nodes.
In the rest of the blog post, we share our custom Ansible deployment and configuration for all ELK stack services on a reference 3 node cluster, which can be easily modified and scaled to meet your specific data indexing and search needs (on as many compute nodes as your budget allows).
ELK node deployment Ansible recipe
The end goal of this blog post is to demonstrate an end-to-end Ansible deployment and configuration of ELK-B cluster including APM Server and Enterprise Search, which we will refer to as ELK stack for brevity.
The detailed steps are described at the elk-ansible.github.io project page. Here we will provide a high-level overview of these steps. In short, these steps are:
- Prepare the minimalistic infrastructure, which consists of four compute nodes, DNS records, and various certificates in PKCS12 for the servers CA files for the clients.
- Obtain the Ansible roles and modify the input configuration for the playbooks, i.e., Ansbile inventory variables and host files in the inventory directory
- For each service, run the command: ansible-playbook -I <inventory_directory> playbook-<service-name>.yml
We need an existing compute infrastructure on which to deploy the ELK stack. For completeness, we provide a set of instructions for deploying the underlying compute infrastructure on AWS. Step 1 for deploying the compute nodes is provided for educational purposes only and does not correspond to any real-world production-grade scenario.
Architecture supported by the Ansible deployment
Currently, we support the latest version of ELK stack version 7. We will be adding support for version 8 soon. All the services communicate over SSL, and the various secret settings are stored in the respective components’ secrets keystore manager except for Enterprise Search, since there is no native secrets manager that is part of that service.
On Figure 1, we outlined the ELK stack configuration supported by the Ansible playbooks and roles. All the settings are controlled via the Ansible Hosts and Variables files. In the setup, we’re accounting for a dedicated monitoring cluster for collecting the observability data via Beats. In the physical setup, we use the same Elasticsearch cluster for both Data and Monitoring roles.
Leave a Reply