Sharing some random tidbits I learned throughout the week.
This week I started doing some work with the Data team and decided to look up some of the terms I kept seeing float around.
- A data warehouse is a database for data from various sources used to support reporting and analytics.
- Hadoop is a Java-based framework that supports proessing of large data sets in a distributed computing environment.
- One processing example is MapReduce.
- MapReduce is an implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
- These large data sets can be stored in a file system like HDFS.
- One processing example is MapReduce.
- HDFS stands for Hadoop Distributed File System and is used to store large volumes of data.
- A NameNode manages cluster metadata and DataNodes that store the data.
- Data is replicated into large blocks on multiple DataNodes.
- This distributed system makes reads very fast.
- If a DataNode goes down, the NameNode creates a replica of the block onto another DataNode.
- Apache Hive is a data warehouse infrastructure built on top of Hadoop to perform analysis on HDFS (ie: MapReduce) using a SQL-Like language called HiveQL.
- SQL programmers can use this as an alternative to Hadoop/Java.
- YARN stands for Yet Another Resource Negotiator.
- Hadoop was originally written solely as a MapReduce engine. You couldn’t really run anything else.
- Now in > Hadoop 0.23, we can run other types of jobs and we can use YARN to manage them.
- HBase is a distributed column-family oriented datastore built on top of HDFS.
- Random-access, key-value store for structured data.
- ETL or Extract Transform Load is a process for data warehousing.
- Extract data from data sources
- Transform data into a nice format for querying
- Load data into final target
- Apache Pig is a platform used to create MapReduce programs used with Hadoop.
- Language for the platform is called Pig Latin.
- Impala allows one to query HDFS using SQL without data movement or transformation.
- Elasticsearch is search engine server that allows you to index your documents and perform full text searches.
- HTTP web interface and REST API.
- Schemaless JSON documents.
- Zookeeper is a coordination service for maintaining configuration information, synchronization and group services for distributed applications.
- Apache Spark is a cluster computing framework that performs much faster than Hadoop MapReduce.
- Uses more RAM than network and disk I/O.
- Data stored in-memory while Hadoop stores data on disk.
- Uses more RAM than network and disk I/O.
- Flume is a distributed service for collecting, aggregating and moving large amounts of log data into Hadoop.
- Kafka is a publish-subscribe messaging system.
- Take the logs from your various infrastructure components and send them to a central commit log.
- Producers like your application write to the commit log.
- Consumers like your monitoring system and Hadoop clusters fetch from the commit log.
- Sqoop is a tool to transfer data betweeen Hadoop and relational database servers.
- Tableau provides interactive data visualization tools.
- You can swap panes in tmux using
prefix-{orprefix-} %win Ruby:
irb(main):001:0> %w(This is neat)
=> ["This", "is", "neat"]- A Candlestick Chart displays the high, low, opening and closing prices for a security for a single day.
- Wide part is called the real body: Tells investors whether the closing price was higher or lower than the opening price.
- The shadows show the day’s high and lows and how they compare to the open and close.
- Long candles means that there was large price movements while smaller ones mean small price movements.