Sharing some random tidbits I learned throughout the week.

This week I started doing some work with the Data team and decided to look up some of the terms I kept seeing float around.

  • A data warehouse is a database for data from various sources used to support reporting and analytics.
  • Hadoop is a Java-based framework that supports proessing of large data sets in a distributed computing environment.
    • One processing example is MapReduce.
      • MapReduce is an implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
      • These large data sets can be stored in a file system like HDFS.
  • HDFS stands for Hadoop Distributed File System and is used to store large volumes of data.
    • A NameNode manages cluster metadata and DataNodes that store the data.
    • Data is replicated into large blocks on multiple DataNodes.
    • This distributed system makes reads very fast.
    • If a DataNode goes down, the NameNode creates a replica of the block onto another DataNode.
  • Apache Hive is a data warehouse infrastructure built on top of Hadoop to perform analysis on HDFS (ie: MapReduce) using a SQL-Like language called HiveQL.
    • SQL programmers can use this as an alternative to Hadoop/Java.
  • YARN stands for Yet Another Resource Negotiator.
    • Hadoop was originally written solely as a MapReduce engine. You couldn’t really run anything else.
    • Now in > Hadoop 0.23, we can run other types of jobs and we can use YARN to manage them.
  • HBase is a distributed column-family oriented datastore built on top of HDFS.
    • Random-access, key-value store for structured data.
  • ETL or Extract Transform Load is a process for data warehousing.
    • Extract data from data sources
    • Transform data into a nice format for querying
    • Load data into final target
  • Apache Pig is a platform used to create MapReduce programs used with Hadoop.
    • Language for the platform is called Pig Latin.
  • Impala allows one to query HDFS using SQL without data movement or transformation.
  • Elasticsearch is search engine server that allows you to index your documents and perform full text searches.
    • HTTP web interface and REST API.
    • Schemaless JSON documents.
  • Zookeeper is a coordination service for maintaining configuration information, synchronization and group services for distributed applications.
  • Apache Spark is a cluster computing framework that performs much faster than Hadoop MapReduce.
    • Uses more RAM than network and disk I/O.
      • Data stored in-memory while Hadoop stores data on disk.
  • Flume is a distributed service for collecting, aggregating and moving large amounts of log data into Hadoop.
  • Kafka is a publish-subscribe messaging system.
    • Take the logs from your various infrastructure components and send them to a central commit log.
    • Producers like your application write to the commit log.
    • Consumers like your monitoring system and Hadoop clusters fetch from the commit log.
  • Sqoop is a tool to transfer data betweeen Hadoop and relational database servers.
  • Tableau provides interactive data visualization tools.
  • You can swap panes in tmux using prefix-{ or prefix-}
  • %w in Ruby:
irb(main):001:0> %w(This is neat)
=> ["This", "is", "neat"]
  • A Candlestick Chart displays the high, low, opening and closing prices for a security for a single day.
    • Wide part is called the real body: Tells investors whether the closing price was higher or lower than the opening price.
    • The shadows show the day’s high and lows and how they compare to the open and close.
    • Long candles means that there was large price movements while smaller ones mean small price movements.