Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 961b64a

Browse files
committed
add posts
1 parent ff08dd2 commit 961b64a

10 files changed

+145
-4
lines changed
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
layout: post
3+
title: Data Science – A Money Making Machine?
4+
bigimg: /img/bitcoin.png
5+
#gh-repo: SubmitCode/Comparison-Of-Algorithms
6+
#gh-badge: [star, fork, follow]
7+
published: true
8+
tags: [AI]
9+
comments: true
10+
---
11+
**Imagine you are sitting at home and a piece of software is running somewhere on a cloud that makes money for you – 24 hours a day, 7 days a week, always at full capacity and never sick. The idea is appealing. Is something like this possible? In the media, we constantly hear of high-frequency traders who win millions by trading within a very short period of time. Is it possible for a team of young, enthusiastic data scientists to create a profitable trading algo in their free time? In this short post, we would like to present our data science journey from the idea to the implementation.**
12+
13+
We live in a wonderful time of digitalization right now. For us, data scientists, this means an abundance of data. It has never been as easy as today to get your hands on data – a true land of plenty for a data scientist. To implement an algorithm, we first need the “basic feed” of any data scientist – data, data, and more data. This is why we first get data from various cryptoexchanges. Why cryptoexchanges? Simply because there is no other place where you can get such good quality data for free. Exchanges such as GDAX, Gemeni, Bitstamp even offer websocket feeds and publish the entire order book. By contrast, comparable data from well-known exchanges, such as the New York Stock Exchange or Xetra, are unaffordable for a private investor. Since the quantities of data are sometimes huge (for example, the GDAX exchange receives several hundred messages per second), we decided to use Splunk. The advantage of Splunk is that it is very easily scalable and we could save and analyze differently structured and formatted data very easily. In addition, we could easily set up alarms and create dashboards, which was a plus for later operation.
14+
15+
After an initial analysis, we detected clear price differences among various exchanges, which we thought we could benefit from. So we analyzed the order book in detail. However, after deducting the trading fees, we were left with next to nothing. Of course, we did not want to give up at this point. We continued our research and loaded more data from other exchanges in our system. We also tried to get better prices through “intelligent” placement of limit orders. Lo and behold: We found a strategy that is profitable at least “on paper”. Now came the tricky part. We had to write a code that would trade as fast as possible and simultaneously on various exchanges. We decided to use Python and a service-oriented architecture with a Message Que (ZeroMQ) for this.
16+
17+
One of the challenges of bitcoin arbitrage of similar systems is that, on the one hand, the bitcoins have to be transferred from one exchange to another, and, on the other hand, the money has to be transferred from one exchange to another. Modern APIs provide this option, but sadly not all the exchanges we wanted to trade on. As a result, our algorithm still needed people to make the transfers. We found a solution for this as well though: With the help of our “new” robotics team, we were able to automate this step using UiPath. This way, the algorithm worked 100% automatically, without manual interventions.
18+
19+
You may be wondering how much profit such a strategy can bring. There is no “sweeping” answer to this, and the following factors need to be considered in order to make an estimate:
20+
21+
- What is the daily payout limit of the exchange?
22+
- How long does it take to transfer the fiat money from one trading account to another?
23+
- Do I hedge the physical bitcoins with a short futures position? Keep in mind: You get about USD 40,000 in margin per bitcoin that you short.
24+
25+
The example shown here illustrates very well that data science is a link between many fields. In the beginning, data analysis knowledge was needed. Later, it was important to develop software and set up the right IT infrastructure. And last but not least, it was and is important to know the exact regulatory requirements and understand payment transactions. This interdisciplinary focus is precisely the strength of the Inventx Data Science Team, comprising computer scientists, mathematicians, technical experts and experienced project managers.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
layout: post
3+
title: Real-World AI for Finding Horse Poo!
4+
#bigimg: /img/PooDetector.gif
5+
#gh-repo: SubmitCode/Comparison-Of-Algorithms
6+
#gh-badge: [star, fork, follow]
7+
published: true
8+
tags: [AI]
9+
comments: true
10+
---
11+
12+
![image](/img/PooDetector.gif)
13+
14+
My wife has a very time-consuming hobby. She has three horses. Fabiola, Herkules and Sophie. As a good husband, I, of course, help her wherever I can, but with my data science background, I’m limited in the things I can do. For instance, I cannot fix the horse panels or construct barns. But finally, my time has come. I found a use case where I can help my wife save time and money with my data science.
15+
16+
In this article, I describe my venture of building an AI which recognizes when a horse poos in the right spot and gives the horse a treat so that the horse learns to poo in the right place. It was quite challenging, and I learned a lot while I was working on the project. But I can prove that AI is profitable and capable of things we couldn’t have imagined before.
17+
18+
A side note: I call it a real-world problem because many of the data science challenges on Kaggle, OpenAI, or elsewhere have a well-defined problem with engineered data. They do 90% of the work for you. In this case, I had to come up with an idea for every aspect of the project.
19+
20+
# Challenge
21+
The most time-consuming part of owning a horse is the cleaning. One should clean up after their horses at least twice a day. If you have ever cleaned a horse box, you know that there is at least one wheelbarrow (German: Schubkarre) of horse poo per day. In our case, it usually takes my wife 30 to 45 minutes to clean the horse boxes, including the sprinkling of sawdust (German: einstreuen von Sägespänen).
22+
23+
To solve this problem, we came up with an idea that would save us a tremendous amount of time. If we could only train our horses to always poo in the same place, and ideally just in front of the manure heap (German: Misthaufen)! Yet to attempt to train a horse in the traditional way would be quite time-consuming. One of us would have to stay in the barn to watch the horses and, whenever a horse pooed in the right spot, give it a treat. Even then, it would not be certain that the horse associates the treat with the act of pooing in the right place. This would need to be done for several days, plus we would need to teach them that the treat is actually related to pooing in the right spot and not to our presence or some other factor.
24+
25+
So we came up with the idea of building a robot to do this for us. The idea is quite simple. We would place an IP camera (with infrared) at the spot where the horses should poo. Then we would train some neural nets, which can detect when a horse is pooing. When pooing is detected, the horse gets a treat from a machine.
26+
27+
That being said, it’s actually quite challenging. But this is not because it’s hard to train a neural net or choose the right architecture. Below is a list of the main real-world challenges we encountered:
28+
29+
- Horse pooing takes about 20 seconds, so in order to get enough pictures to get started, you need to capture at least one frame per second—and even then, it could be that you don’t capture the part where the poo falls on the ground. Also, we only installed one camera for detecting poo. So, by chance, the horse needs to poo in the right spot. This means we got 86,400 pics per day for, let’s say, 60 pooing pictures—and initially, we had to go through them by hand and label them.
30+
- We tried to label the picture based on the action that was performed. Typically, one would use a 3D Convolutional Neural Network (NN) or a recurrent NN. For now, I settled on a ResNet50 (Residual Neural Net with 50 Resnet blocks).
31+
- as being said we have 3 horse. One big female horse and two miniature shetland ponies one stalion and one mare and their pooing pictures actually look quite different. For instance the pooing pose of the small horse is so difficult to capture as Herkules and Sophie have a very long tail with lots of hair that we focused mainly on Fabiola, the big horse.
32+
33+
# Does it work?
34+
Yes but there is still alot work to do. We still have the problem of false positives. For instance it's hard to catch the pooing action if the horse stands at a certain angle to the camera and depending on weather conditions it's sometimes hard to distinguish between pooing and peeing.
35+
36+
# How it works
37+
![image](/img/solution_outline.jpg)
38+
39+
# Below a picture of the feeding robot
40+
![image](/img/feeding_robot.gif)
41+
42+
# Is it a real world application?
43+
Well if you do projects professionally one has to calculate the business case. So here we go. Let's assume my wife saves about 30 min a day cleaning horse box. In Switzerland the median gross income is about 6500 CHF per month. For the sake of argument I assume that the net income is 33% lower than the gross income. Which gives me an hourly rate of roughly 23 CHF per hour. The average month has about 30 days this means she is spending 15 hours per month and 180 hours per year for cleaning the horse. This gives us a saving of around 4140CHF or 4100$ per year without considering the running costs.
44+
45+
## Cost of the infrastructure
46+
- IP cam 300 CHF
47+
- LTE router 150 CHF
48+
- Raspberry Pi + equipmnet 100 CHF
49+
- Setup work (it takes a couple of days initially, but I expect the second project will go much faster): I will assume five days, which translates to 2,000 CHF (I assume a higher hourly rate for IT work)
50+
- Electric engine and stuff for the feeding mechanism 100 CHF
51+
52+
## Running costs
53+
- Google Cloud Linux VM 95 CHF
54+
- Five Hour lease of graphics for 16 hours per month 7 CHF
55+
- Internet 35 CHF
56+
- Work per month for labeling pics and maintainance 50 CHF
57+
58+
All of above gives us a monthly saving of around 158 CHF and intial cost of 2750 CHF. This give us a payback period of 18 months and around 1900 CHF of savings per year.
59+
60+
# Summary & next steps
61+
I have to admit, this is a special use case. But it smells like a business idea, doesn’t it? Take computer vision, add AI, and solve a real-world problem. Just think of other use cases, such as:
62+
63+
Do you, like most of us, hate it when you have to wait in line to pay for your groceries while the other checkouts are closed? A store could train an AI to detect long lines and automatically call a cashier.
64+
The store could also track the performance of its salespeople. For instance, as an owner or manager, you could measure how long it takes on average until a customer gets contacted when a salesperson is in the show room.
65+
Right now I am playing around with Retinanet, YOLO2 and FASTER RCNN architecture. In plain english. I draw bounding boxes around the horse to let the AI know which is the interesting area of the picture.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
layout: post
3+
title: Deep learning for tabular data?
4+
bigimg: /img/tabularData.jpeg
5+
gh-repo: SubmitCode/Comparison-Of-Algorithms
6+
gh-badge: [star, fork, follow]
7+
published: true
8+
tags: [AI]
9+
comments: true
10+
---
11+
12+
In spite of all the hype about deep learning, in my experience over 95% of the time we must deal with plain and simple csv files – in other words, tabular data. So the question for me is, **Can we use deep learning for tabular data?**
13+
14+
In order to answer this question, I used the Sandander [Customer Satisfaction Dataset](https://www.kaggle.com/c/santander-customer-satisfaction/data) from [Kaggle](https://www.kaggle.com/).I chose this set because it’s much like the datasets our customers usually work with. And it appears that Sandander is trying to do a churn prediction, which is always a hot topic.
15+
16+
## Why consider deep learning anyway?
17+
There is this famous graph, presented below, which shows that the more data you throw at the deep learning, the more it improves, whereas traditional ML models hit a plateau.
18+
![image](/img/amount_vs_performance.png)
19+
20+
## Data
21+
Our test data has 370 features and around 76,000 data points. We tried to predict whether we would have a happy customer or an unhappy customer. If you look at the data, you will see that around **4%** of the customers are unhappy.
22+
I did not do any data pre-processing other than feature scaling (meaning that all values fall into the same range). Omitting these extra steps made it easier for most of the algorithms to make predictions.
23+
24+
## Test Setup
25+
The test setup is quite straightforward. We used the training data file, scaled their features, and ran the file through some selected algorithms. As an objective function we used the official Kaggle challenge metric which is the area under the ROC curve (receiver operating characteristic).
26+
In order to test the setup properly, we used the mean of the cross validation results.
27+
28+
## Results
29+
30+
| Algo | auc_score |
31+
|----------------|-----------|
32+
| tree | 0.573 |
33+
| extra_tree | 0.644 |
34+
| forest | 0.685 |
35+
| ada_boost | 0.826 |
36+
| bagging | 0.701 |
37+
| grad_boost | 0.835 |
38+
| ridge | 0.791 |
39+
| passive | 0.701 |
40+
| sgd | 0.646 |
41+
| gaussian | 0.523 |
42+
| xgboost | 0.838 |
43+
| deeplearning | 0.770 |
44+
45+
46+
## Summary
47+
The above table shows quite clearly that ensemble techniques still perform much better than deep learning out of the box. Another good reason not to choose deep learning is that it’s computationally intensive. Despite these shortcomings, we still have a few good reasons to choose deep learning over traditional methods:
48+
- **Combine different types of data:** You can combine pictures, free text, and tabular data within one deep learning model, something that can be quite difficult for other ML algorithms.
49+
- **Big data:** Your tabular data may eventually become big data. When that happens, your deep learning model will keep on improving while the traditional models will have hit a plateau.
50+
- **Less feature engineering needed:** Traditional ML algorithms usually require a lot of domain-specific knowledge. With deep learning, it’s possible to get great results without requiring much in the way of feature engineering
51+

_posts/2020-12-20-is-your-company-ready-for-ai.md renamed to _posts/2019-11-06-is-your-company-ready-for-ai.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,16 @@ Therefore, here are my five predictions for the next two years with small and me
1313
## It’s still all about data
1414
Companies will move away from the traditional data warehouse approach to data lakes with logical data warehouses. Many companies we work with still underutilise their data simply because they cannot keep pace with the variety of new data sources. Traditional data warehouses are kind of slow when it comes to onboarding data from many different sources in an integrated way. That’s why there’s a data lake in the first place, but there’s still a need for a common interface to the data—especially in terms of compliance, accessibility and security. Logical data warehouses offer a solution to this problem. They let you access all the data from different sources with a common interface and allow onboarding of new data in a fast and easy way. Other than this, we’ll see more data products which will help us enhance the existing data.
1515

16-
# AI and ML off the shelf
16+
## AI and ML off the shelf
1717
We’re going to see more products, such as Azure cognitive services, which will help us implement state-of-the art AI and ML algorithms without the need for in-depth knowledge. Furthermore, many vendors will implement AI in their products to improve the usability of these products. Data scientists’ roles will therefore change, as they become even more like integrators and act as a bridge between departments and vendors.
1818

19-
# Devops for data science
19+
## Devops for data science
2020
Many companies I work with tend to do proofs of concepts, and some of them are successful, but ultimately, they fail to put these proofs of concepts into production. Therefore, devops will evolve in this space.
2121

22-
# Data strategy – data offense vs. defence
22+
## Data strategy – data offense vs. defence
2323
The companies we work with, including Inventx, are very good at defining data defence strategies. We care greatly for our data, as well as our customers’ data. We have many processes in place to protect data. This will change, as companies will also define their strategies to monetise and utilise their data more instead of just protecting data.
2424

25-
# Data democratisation and self-service
25+
## Data democratisation and self-service
2626
We already see more and more of self-service. This trend will develop further. With the rise of Power BI and similar tools, workers will be able to apply simple ML and AI algorithms themselves. Ultimately, co-workers will come to you with ideas and use cases if they know their data.
2727

2828
All in all, I believe that it will be crucial for SMEs within the next two years to develop a sound data strategy to be ready for AI.

img/PooDetector.gif

14.3 MB
Loading

img/amount_vs_performance.png

22.7 KB
Loading

img/bitcoin.png

1.11 MB
Loading

img/feeding_robot.gif

23.4 MB
Loading

img/solution_outline.jpg

95.8 KB
Loading

img/tabularData.jpeg

128 KB
Loading

0 commit comments

Comments
 (0)