Skip to main content

Posts

Showing posts from November, 2017

Kafka 101: Deploying Kafka to Google Compute Engine

This article provides a startup script for deploying Kafka to a Google Compute Engine instance. This isn’t meant to be a production-ready system - it uses the  Zookeeper  instance embedded with Kafka and keeps most of the default settings. Instead, treat this as a quick and easy way do Kafka development using a live server. This article uses Compute Engine  startup scripts  to install and run Kafka on instance startup. Startup scripts allow you to run arbitrary Bash commands whenever an instance is created or restarted. Since this script is run on every restart, we lead with a check that makes sure we have not already ran the startup script and, if we have, we simply exit. #!/usr/bin/env bash STARTUP_VERSION=1 STARTUP_MARK=/var/startup.script. $STARTUP_VERSION if [[ -f $STARTUP_MARK ]]; then exit 0 fi Then we configure our Kafka and Scala version numbers used in the rest of the script. SCALA_VERSION=2.10 KAFKA_VERSION=0.9.0.0-SNAPSHOT KAFKA_HOME=/opt/kafka_ "

Kafka 101: Kafka Quick Start Guide

If you’ve read the previous article describing  Kafka in a Nutshell  you may be itching to write an application using Kafka as a data backend. This article will get you part of the way there by describing how to deploy Kafka locally using  Docker  and test it using  kafkacat . Running Kafka Locally First, if you haven’t already, download and install  Docker . Once you have Docker installed, create a default virtual machine that will host your local Docker containers. > docker-machine create --driver virtualbox default Clone the Kafka docker repository You could use the Docker pull command here but I find it instructive to be able to view the source files for your container. > git clone https://github.com/wurstmeister/kafka-docker Set a default topic Open  docker-compose-single-broker.yml  and set a default topic and advertised name. You will want to use the IP address of your default Docker machine. Copy it to the clipboard with the following command. > docker

Kafka 101: Kafka in a Nutshell

Kafka is a messaging system. That’s it. So why all the hype? In reality messaging is a hugely important piece of infrastructure for moving data between systems. To see why, let’s look at a data pipeline without a messaging system. This system starts with Hadoop for storage and data processing. Hadoop isn’t very useful without data so the first stage in using Hadoop is getting data in. Bringing Data in to Hadoop So far, not a big deal. Unfortunately, in the real world data exists on many systems in parallel, all of which need to interact with Hadoop and with each other. The situation quickly becomes more complex, ending with a system where multiple data systems are talking to one another over many channels. Each of these channels requires their own custom protocols and communication methods and moving data between these systems becomes a full-time job for a team of developers. Moving Data Between Systems Let’s look at this picture again, using Kafka as a central messa

SQS or Kinesis? Comparing Apples to Oranges

When designing a durable messaging system I took a hard look at using Amazon’s Kinesis as the message storage and delivery mechanism. At first glance, Kinesis has a feature set that looks like it can solve any problem: it can store terabytes of data, it can replay old messages, and it can support multiple message consumers. But if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth. In this article, I compare Kinesis with Amazon’s Simple Queue Service (SQS), showing the benefits and drawbacks of each system, and highlighting the difference between data streams and queueing. This article should make clear why we built our durable messaging system using SQS, and why your application might benefit from SQS too. Data Streams - The Kinesis Sweet Spot Kinesis’ primary use case is collecting, storing and processing real-time continuous data