Visualize Thread by @mdancho84 | Thread Navigator

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Matt Dancho (Business Science)

@mdancho84

Understanding probability is essential in data science.

In 4 minutes, I'll demolish your confusion.

Let's go!

Thread image

Matt Dancho (Business Science)

@mdancho84

1. Statistical Distributions:

There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.

Thread image

Matt Dancho (Business Science)

@mdancho84

2. Discrete Distributions:

Discrete distributions are used when the data can take on only specific, distinct values. These values are often integers, like the number of sales calls made or the number of customers that converted.

Matt Dancho (Business Science)

@mdancho84

3. Continuous distributions:

Used for data that can take on any value within a range or interval. These values are typically real numbers, like the percentage of visitors that converted or the forecasted revenue over the next 6 months.

Matt Dancho (Business Science)

@mdancho84

4. Probability Mass Function (Discrete):

Discrete distributions are described by a probability mass function, which gives the probability that a discrete random variable is exactly equal to some value. In a graph, a discrete distribution is often represented by a series of bars, where each bar represents the probability of each discrete outcome.

Thread image

Matt Dancho (Business Science)

@mdancho84

5. Probability Density Function (Continuous):

Continuous distributions are described by a probability density function. The probability of the variable falling within a particular range is given by the area under the curve of the PDF within that range. In a graph, a continuous distribution is usually represented by a smooth curve.

Thread image

Matt Dancho (Business Science)

@mdancho84

6. Parametric Models: Many models assume a specific distribution.

- Linear Regression: Assumes normally distributed errors.
- Logistic Regression: Assumes a binomial distribution of the response variable.

Matt Dancho (Business Science)

@mdancho84

7. Non-Parametric Models:

These models do not make strong assumptions about the form of the data distribution.

- Decision Trees
- K-Nearest Neighbors
- Support Vector Machines

Matt Dancho (Business Science)

@mdancho84

8. Loss Functions:

Distributions will come up in Loss Functions in Machine Learning (e.g. XGBoost, LightGBM, CatBoost). Selecting the right Loss Function can often improve performance.

Examples:

- Poisson is used for count data.
- Tweedie for mixed continuous data with many zeros like intermittent demand forecasting problems.

Thread image

Matt Dancho (Business Science)

@mdancho84

One more thing before you go.

A new problem has surfaced -- Companies NOW want AI.

Yet 99% of data scientists are ignoring it.

That's a huge advantage to you. I'd like to help.

Matt Dancho (Business Science)

@mdancho84

On Wednesday, June 21st, I'm sharing one of my best AI Projects:

How I built an AI Customer Segmentation Agent with Python:

Register here (limit 500 seats): learn.business-science.io/ai-register

Thread image

Matt Dancho (Business Science)

@mdancho84

That's a wrap! Over the next 24 days, I'm sharing the 24 concepts that helped me become an AI data scientist.

If you enjoyed this thread:

1. Follow me @mdancho84 for more of these
2. RT the tweet below to share this thread with your audience

View Tweet

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export