Semantic Chunker¶

class SemanticChunker¶

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt’s notebook: 5 Levels Of Text Splitting.

Parameters:

embed_model (BaseEmbedding) – Embedding model used for semantic chunking.
buffer_size (int, optional) – Size of the buffer for semantic chunking. Default is 1.
breakpoint_threshold_amount (int, optional) – Threshold percentage for detecting breakpoints. Default is 95.
device (str, optional) – Device to use for processing. Currently supports “cpu” and “cuda”. Default is cpu.

Example

from pineflow.core.text_chunkers import SemanticChunker
from pineflow.embeddings.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)

from_documents(documents)¶

Split documents into chunks.

Parameters:: documents (List[Document]) – List of Document objects to split.
Returns:: List of chunked documents objects.
Return type:: List[Document]

from_text(text)¶

Split text into chunks.

Parameters:: text (str) – Input text to split.
Returns:: List of text chunks.
Return type:: List[str]