Strip UTF-8 byte order mark (BOM) from strings, bytes, streams, and files. Inspired by the popular strip-bom npm package
Project description
Strip BOM
Strip UTF-8 byte order mark (BOM) from strings, bytes, streams, and files. Inspired by the popular strip-bom npm package.
Features
- Multiple input types: Strip BOM from strings, bytes, bytearrays, streams, and files
- Smart validation: Validates UTF-8 encoding before processing buffers
- Memory efficient: Handles large files and streams without loading everything into memory
- Zero dependencies: Lightweight with no external dependencies
- Type safe: Full type hints for excellent IDE support
- Robust: Graceful error handling and Unicode support (emojis, CJK characters, etc.)
Why Strip BOM?
The UTF-8 Byte Order Mark (BOM) can cause issues when:
- ❌ Processing files from different sources (some have BOM, others don't)
- ❌ Comparing strings that should be identical but differ only by BOM
- ❌ Working with APIs that don't expect BOM characters
- ❌ Parsing JSON, CSV, or other structured data formats
Note: The Unicode Standard permits BOM in UTF-8 but doesn't require or recommend it, since byte order is irrelevant for UTF-8.
Installation
pip install strip-bom
Usage Examples
Strings
from strip_bom import strip_bom
text_with_bom = '\ufeffunicorn'
clean_text = strip_bom(text_with_bom)
print(clean_text) # 'unicorn'
# Text without BOM remains unchanged
normal_text = 'Hello World'
print(strip_bom(normal_text)) # 'Hello World'
Bytes and Buffers
from strip_bom import strip_bom_buffer
bytes_with_bom = b'\xef\xbb\xbfunicorn'
clean_bytes = strip_bom_buffer(bytes_with_bom)
print(clean_bytes) # b'unicorn'
# Invalid UTF-8 is left unchanged (safety first!)
invalid_utf8 = b'\xef\xbb\xbf\xff\xfe'
result = strip_bom_buffer(invalid_utf8)
print(result == invalid_utf8) # True (no changes made)
Streams (Memory Efficient)
from strip_bom import strip_bom_stream
import io
# Process large streams without loading everything into memory
stream = io.BytesIO(b'\xef\xbb\xbfLarge file content here...')
# Process in chunks
for chunk in strip_bom_stream(stream, chunk_size=8192):
# Process each chunk as needed
print(chunk)
# Or get all content at once
stream.seek(0)
content = b''.join(strip_bom_stream(stream))
Files
from strip_bom import strip_bom_file
# Text mode (reads as UTF-8)
content = strip_bom_file('data.txt', mode='r')
print(f"File content: {content}")
# Binary mode
binary_content = strip_bom_file('data.txt', mode='rb')
print(f"Binary content: {binary_content}")
API Reference
strip_bom(text: str) -> str
Remove BOM from Unicode string.
strip_bom_buffer(buffer: Union[bytes, bytearray]) -> bytes
Remove BOM from bytes/bytearray if valid UTF-8.
strip_bom_stream(stream: BinaryIO, chunk_size: int = 8192) -> Iterator[bytes]
Remove BOM from binary stream, yielding chunks.
strip_bom_file(file_path: str, mode: str = 'r') -> Union[str, bytes]
Remove BOM from file content. Mode can be 'r'/'rt' for text or 'rb' for binary.
Learn More
- W3C: The byte-order mark (BOM) in HTML
- Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM
- Wikipedia: Byte Order Mark
Acknowledgments
Inspired by Sindre Sorhus's strip-bom npm package.
Changelog
See CHANGELOG.md for a detailed list of changes and version history.
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Support
If you find this library helpful:
- ⭐ Star the repository
- 🐛 Report issues
- 🔀 Submit pull requests
- 💝 Sponsor on GitHub
License
MIT © Y. Siva Sai Krishna - see LICENSE file for details.
Author's GitHub • Author's LinkedIn • Report Issues • Package on PyPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strip_bom-1.0.0.tar.gz.
File metadata
- Download URL: strip_bom-1.0.0.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a13b8ae753d38ed716b0c8a42d057dc2ce2d6c6868a89fc62cf2a8b40ba4d82
|
|
| MD5 |
be9bdf85c3fd5c1af634e71e89a7d420
|
|
| BLAKE2b-256 |
8e5c044f6c5cd39ac4a087760f8d0715175c1aba026ed338f572e8c4f87f0e86
|
File details
Details for the file strip_bom-1.0.0-py3-none-any.whl.
File metadata
- Download URL: strip_bom-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19310f369058c0544a71c98abc664755e38bfe6889e6ec00b80e4504a9205b7b
|
|
| MD5 |
f00ddc6a5ac72201a23943edf909b960
|
|
| BLAKE2b-256 |
64b444f372af7af2e86ebe23088b412b2fe06e3f60473295ddaf66d1cde0726f
|