Semantic Chunker

class SemanticChunker

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt’s notebook: 5 Levels Of Text Splitting.

Parameters:
  • embed_model (BaseEmbedding) – Embedding model used for semantic chunking.

  • buffer_size (int, optional) – Size of the buffer for semantic chunking. Default is 1.

  • breakpoint_threshold_amount (int, optional) – Threshold percentage for detecting breakpoints. Default is 95.

  • device (str, optional) – Device to use for processing. Currently supports “cpu” and “cuda”. Default is cpu.

Example

from pineflow.core.text_chunkers import SemanticChunker
from pineflow.embeddings.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)
from_documents(documents)

Split documents into chunks.

Parameters:

documents (List[Document]) – List of Document objects to split.

Returns:

List of chunked documents objects.

Return type:

List[Document]

from_text(text)

Split text into chunks.

Parameters:

text (str) – Input text to split.

Returns:

List of text chunks.

Return type:

List[str]