TFT

CSV Sample Generator

Working with millions of rows but only need a representative slice? Extract a random sample of any size — by row count or percentage — with optional stratification to ensure balanced representation across key columns.

CSV Sample Generator

Extract random or stratified samples from CSV data with percentage-based or fixed-count sampling, with seed support for reproducibility

Drag and drop a CSV file here, or click to browse

or paste CSV data below

Same seed will produce the same sample every time

Maintain proportional representation across groups

Sampling methods:

  • Fixed Count: Extract exactly N rows from the dataset
  • Percentage: Extract a percentage of total rows
  • Seed: Use a seed value for reproducible random sampling
  • Stratified: Maintain proportional representation across groups in a column (useful for balanced class distribution)

What This Tool Does

This tool extracts a random sample of rows from your CSV file. Choose a fixed number of rows (e.g., 100 rows) or a percentage (e.g., 10% of all rows). Optional stratified sampling ensures the sample maintains the same distribution as the original data.

Sampling Options

Fixed count: Extract exactly N random rows. Useful when you need a specific sample size.

Percentage: Extract X% of all rows. Useful when you want a proportional sample.

Stratified sampling: Maintain the same distribution of values in a selected column. If 30% of your data is "Active", 30% of the sample will be "Active".

Seed value: Set a seed for reproducible random sampling. Same seed = same sample every time.

Example: Random Sample

Input CSV (1000 rows):

id,name,status,score
1,Alice,Active,85
2,Bob,Inactive,72
... (998 more rows)
1000,Zoe,Active,91

Sample: 10% (100 rows)

Output CSV (100 random rows):

id,name,status,score
23,Carol,Active,88
156,David,Inactive,65
... (98 more rows)
892,Eve,Active,79

Example: Stratified Sample

Input distribution:

Status distribution:
Active:   600 rows (60%)
Inactive: 300 rows (30%)
Pending:  100 rows (10%)

Stratified sample (100 rows):

Status distribution in sample:
Active:   60 rows (60%)
Inactive: 30 rows (30%)
Pending:  10 rows (10%)

The sample preserves the original distribution.

When to Use This

Quick data exploration: Sample a large file to understand its structure before full processing.

Testing: Create smaller test datasets from production data for development environments.

Statistical analysis: Work with a manageable sample when the full dataset is too large.

Machine learning: Create training/test splits from your data.

Quality assurance: Randomly sample records for manual review or audit.

Simple Random vs Stratified

Simple random sampling: Every row has equal chance of selection. Fast and simple, but may not represent rare categories well.

Stratified sampling: Ensures each group is proportionally represented. Better for analysis where group distribution matters.

Example: Fraud detection dataset
- 99% legitimate transactions
- 1% fraudulent transactions

Simple random sample of 100: May have 0-2 fraud cases
Stratified sample of 100: Exactly 1 fraud case (1%)

Sample Size Guidelines

For exploration: 100-1000 rows usually sufficient to understand structure.

For testing: Match your typical production batch size.

For analysis: Larger samples give more accurate results. 10% is common for large datasets.

For rare events: Use stratified sampling or ensure sample is large enough to capture rare cases.

Reproducible Sampling

Use the seed option for reproducible samples:

Seed: 42 → Same 100 rows every time
Seed: (empty) → Different random sample each run

Useful for tests and analyses that need consistent data.

Limitations

Large files: The entire file loads into memory. Files over 100MB may cause slow performance.

Very small samples: Sampling 1 row from 1 million may not be truly random due to algorithm limitations.

Stratified with many groups: If the stratification column has many unique values, some groups may have too few rows for proper sampling.

Frequently Asked Questions

Is the sampling truly random?

The tool uses a seeded random number generator. Without a seed, each run produces different results. With a seed, results are reproducible.

Can I sample without replacement?

Yes. Each row can only be selected once. You won't get duplicate rows in your sample.

What if I request more rows than exist?

The tool returns all available rows if you request more than exist. No error is thrown.