Is ggplot from yhat an inactive project

Recently, many products for data analysis platform have emerged in China, such as zmagic mirror with data view。

The goal of these products should be self-service BI, use visualization to provide data exploration capabilities, and add machine learning and prediction capabilities. Your target product should be Tableau or SAP Lumira. Since the author developed data visualization functions for Lumira, I was very interested in this one. That's why I tried these products and found that there seems to be a huge gap between these products. So I wanted to do a simple data analysis using open source software. Try the platform out.

The code is here https://github.com/gangtao/dataplay2

Stop talking nonsense on the architecture diagram:

List the most important open source software:

Service terminal:

Client:

Development and Build Tools

  • nodejs https://nodejs.org/en/

    This should not be introduced

  • babel https://babeljs.io/

    The Javascript compiler supports the conversion of ES6 code into executable browser code. This is mainly for compiling jsx, which will be used by reactjs.

After listing so much open source software, let's take a look at the features of dataplay2 and then the role of this open source software and the reason I chose it.

Before we get on the subject, let's talk about the name dataplay2. Dataplay is easy to understand. I hope to create a simple and easy-to-use data platform that is just as enjoyable to use as gaming. But why is it 2? Because this software is very different? of course not. Actually, I've already written a dataplayYes, the architecture was a little different at the time. To use the ggplot in R to support the syntax driven visualization solution, I used the R / Python bridge solution in the background. The visualization operation in the foreground generates ggplot commands. The advantage is that there can be a unified data model and grammar used to control the visual analysis of data, which is convenient for users to examine the data. However, this architecture is too complicated. There is both R and Python on the server side. I can't take it anymore and then gave up. The new dataplay2 uses echart's diagram library for visualization and we will talk about the pros and cons later.

Running dataplay2 is very easy. After downloading the code on github it is recommended to install anaconda, all python dependencies are ready, enter the dataplay2 / package directory and run:

Here is a supplemental explanation, since the jsf needs to be compiled by react, you need to run the following commands to compile jsf with babel. The specific commands are as follows:

In addition, you must install all of the client's dependencies using bower

The required dependencies can also be found in package / static / package.json. It's time to incorporate a simpler build script to do these things. The generated JS file is located in the lib directory. Change the original file in the js directory. Babel triggers the compilation and generates a new js file in the lib directory.

Then enter localhost: 5000 in the browser to start the client.

First we call up the data menu

This page allows users to browse existing data or upload a CSV file to add a record.

Briefly introduce the realization of this part.

The file input control is used for uploading data and the datable control is used for the data table. To facilitate direct storage of CSV files in the local file system. In the background, pandas are used to process CSV files. The front desk uses the REST API to read the CSV file, then parse it with Papaparse and display it in the data table. This is for convenience only as I spent 3/4 days building the entire POC while on vacation. How convenient is that? A better approach is to use Python to parse the CSV file in the background.

Note that there are strict requirements for the uploaded CSV file, which must contain a header in the first line and no blank lines at the end.

Once you have the data, you can start analyzing. First, let's look at visual analysis. Click the Analysis / Visualization menu

For example, let's select the iris data source to create a scatter plot

The main work of this visualization is to transform the table structure data from CSV into the data structure from echart according to the data binding. Since echart does not have a uniform data model, each diagram type must have a corresponding data deformation logic. (Code package / static / js / visualization)

Now I mainly do pie, bar, line, tree map, scatter and area charts.

Now I feel like Echart has obvious advantages and disadvantages after using it. The auxiliary functions it provides are very good. You can easily add guides and notes and save them as graphics. However, due to the lack of a unified data model, expansion is difficult. Hope I have time to try it out, plotly, of course, Highchart is a very mature chart library with no evidence.

In fact, I hope to find a D3 implementation of ggplot like this http://benjh33.github.io/ggd3/ unfortunately the project seems to be inactive.

In addition to visualization-based analysis functions, there are also functions for machine learning.

classification

The classification algorithm can use KNN, Bayes and SVM.

When I choose two features to make predictions, I use D3 to draw the predictive model. If it's bigger than two, there's no way to draw it.

The user can then make predictions based on the model.

The functions and the classification of clustering and regression are basically the same.

Clustering

The clustering algorithm now implements Kmeans

Linear regression

Logistic regression

 

That's all for the basic functions, here are some functions I want to achieve:

  • Data Source

    The current data sources are only CSV files and more data source support can be considered, e.g. B. Databases / data warehouses, REST calls, streams, etc.

  • Data model

    The current data model is relatively simple, i.e. a pandas data frame or a simple CVS table structure. Consider introducing a database. Must also increase support for hierarchical data (hierarchical)

  • Data distortion

    Data deformation is a necessary preparation for data analysis. There are many products in the industry that focus on data preparation, such as zpaxata, trifacta

    This version of the data game does not have any functions for data shaping and preparation. In fact, pandas has a very extensive data wrangling function. I hope I can put a DSL on it for data wrangling so that users can prepare data quickly.

  • Visualization library

    Baidus Echart is an excellent visualization library, but not good enough for data exploration. We hope to have a number of front end visualization libraries that are similar to ggplot. In addition, map functions and hierarchical diagrams are common functions for data analysis.

    In addition, the option of the diagram must be added

  • Dashboard function

    This version of Dataplay does not have a dashboard function. This function is a standard configuration of the data analysis software and must be present. Pyxley It seems like a good choice and is in line with the architecture of the data game (Python, Reactjs). You can try if you have the time

  • Machine learning and prediction

    Dataplay has now implemented some of the simplest machine learning algorithms out there. I think the direction should be user-centric and simpler. The user just provides simple options like the target attribute to be predicted and the attribute used for the prediction, and then automatically selection algorithm. In addition, the algorithm needs to be expanded more conveniently.

Okay, finally talk about simple feelings

  • Reactjs is really good. I always didn't like MVC. The componentization of Reactjs is more convenient to use and the development efficiency is really high. I completed the entire project in 3/4 days while on vacation and responding is essential.

  • The current functionality of Dataplay is still relatively weak, but the basic structure has been established. You can expand them if you want. I may not have time to further improve its functions, but anyone is welcome to discuss with me.

To update:

Since many students reported that it couldn't run normally, I created a Docker file. For information on creating, see https://github.com/gangtao/dataplay2/tree/master/docker. Hope to solve the problem that not everyone can do. To
The image was published in the Docker Hub: https://hub.docker.com/r/naughtytao/dataplay/