Python에서 문자열을 바꾸는 방법

2023. 12. 13. 21:26ㆍpython/basic

Python 문자열 또는 하위 문자열을 제거하거나 바꾸는 방법

"Fake Python".replace("Fake", "Good!")

'Good! Python'

참고: Python 셸이 .replace()의 결과를 표시하더라도 문자열 자체는 변경되지 않은 상태로 유지됩니다. 문자열을 변수에 할당하면 이를 더 명확하게 확인할 수 있습니다.

name = "Fake Python"
name.replace("Fake", "Good!")

'Good! Python'

name

'Fake Python'

name = name.replace("Fake","Good!")
print(name)

Good! Python

name

'Good! Python'

transcript = """\
[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!"""

transcript.replace("BLASTED", "😤")

"[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?\n[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY 😤 ACCOUNT\n[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?\n[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!"

transcript.replace("BLASTED", "😤").replace("Blast", "😤")

"[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?\n[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY 😤 ACCOUNT\n[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?\n[johndoe] 2022-08-24T10:04:03+00:00 : 😤! You're right!"

여러 교체 규칙 설정

# transcript_multiple_replace.py

REPLACEMENTS = [
    ("BLASTED", "😤"),
    ("Blast", "😤"),
    ("2022-08-24T", ""),
    ("+00:00", ""),
    ("[support_tom]", "Agent "),
    ("[johndoe]", "Client"),
]

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!
"""

for old, new in REPLACEMENTS:
    transcript = transcript.replace(old, new)

print(transcript)

Agent  10:02:23 : What can I help you with?
Client 10:03:15 : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  10:03:30 : Are you sure it's not your caps lock?
Client 10:04:03 : 😤! You're right!

이 버전의 기록 정리 스크립트에서는 대체 항목을 추가하는 빠른 방법을 제공하는 대체 튜플 목록을 만들었습니다. 대체 항목이 많은 경우 외부 CSV 파일에서 이 튜플 목록을 만들 수도 있습니다.

그런 다음 대체 튜플 목록을 반복합니다. 각 반복에서 문자열에 대해 .replace()을 호출하여 각 반복에서 압축이 풀린 old 및 new 변수로 인수를 채웁니다. 대체 튜플입니다.

참고: 이 경우 for 루프에서 압축을 푸는 것은 색인 생성을 사용하는 것과 기능적으로 동일합니다.

for replacement in REPLACEMENTS:
    new_transcript = transcript.replace(replacement[0], replacement[1])

활용re.sub()하여 복잡한 규칙 만들기

Python에서 정규식을 활용한다는 것은 re 모듈의 sub() 함수를 사용하고 자신만의 정규식 패턴을 구축하는 것을 의미합니다.

# transcript_regex.py

import re

REGEX_REPLACEMENTS = [
    (r"blast\w*", "😤"),
    (r" [-T:+\d]{25}", ""),
    (r"\[support\w*\]", "Agent "),
    (r"\[johndoe\]", "Client"),
]

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!
"""

for old, new in REGEX_REPLACEMENTS:
    transcript = re.sub(old, new, transcript, flags=re.IGNORECASE)

print(transcript)

Agent  : What can I help you with?
Client : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  : Are you sure it's not your caps lock?
Client : 😤! You're right!

sub() 함수와 .replace() 메소드를 혼합하여 사용할 수 있지만 이 예에서는 sub()만 사용하므로 어떻게 사용되는지 볼 수 있습니다. 이제 하나의 대체 튜플만 사용하여 욕설의 모든 변형을 대체할 수 있습니다. 마찬가지로 전체 타임스탬프에 대해 하나의 정규식만 사용하고 있습니다.

첫 번째 정규식 패턴, “blast”은 특수 문자는 영숫자 문자 및 밑줄과 일치합니다. * 수량자를 바로 뒤에 추가하면 .의 0개 이상의 문자와 일치합니다.

두 번째 정규식 패턴은 문자 세트 및 수량자를 사용합니다. 타임 스탬프를 교체합니다. 문자 집합과 수량자를 함께 사용하는 경우가 많습니다.

세 번째 정규식 패턴은 키워드 “support”로 시작하는 사용자 문자열을 선택하는 데 사용됩니다. 대괄호()를 이스케이프 () 처리해야 합니다. 그렇지 않으면 키워드는 문자 집합으로 해석됩니다.

마지막 정규식 패턴은 클라이언트 사용자 이름 문자열을 선택하고 이를 “Client”으로 바꿉니다.

더 많은 제어를 위해 re.sub() 콜백 사용

Python과 sub()이 자랑하는 한 가지 비결은 콜백 함수를 전달할 수 있다는 것입니다. 대체 문자열 대신. 이를 통해 일치 및 교체 방법을 완벽하게 제어할 수 있습니다.

# transcript_regex_callback.py

import re

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!
"""

def sanitize_message(match):
    print(match)

re.sub(r"[-T:+\d]{25}", sanitize_message, transcript)

<re.Match object; span=(15, 40), match='2022-08-24T10:02:23+00:00'>
<re.Match object; span=(79, 104), match='2022-08-24T10:03:15+00:00'>
<re.Match object; span=(159, 184), match='2022-08-24T10:03:30+00:00'>
<re.Match object; span=(235, 260), match='2022-08-24T10:04:03+00:00'>

"\n[support_tom]  : What can I help you with?\n[johndoe]  : I CAN'T CONNECT TO MY BLASTED ACCOUNT\n[support_tom]  : Are you sure it's not your caps lock?\n[johndoe]  : Blast! You're right!\n"

스크립트에 콜백 적용

# transcript_regex_callback.py

import re

ENTRY_PATTERN = (
    r"\[(.+)\] "  # User string, discarding square brackets
    r"[-T:+\d]{25} "  # Time stamp
    r": "  # Separator
    r"(.+)"  # Message
)
BAD_WORDS = ["blast", "dash", "beezlebub"]
CLIENTS = ["johndoe", "janedoe"]

def censor_bad_words(message):
    for word in BAD_WORDS:
        message = re.sub(rf"{word}\w*", "😤", message, flags=re.IGNORECASE)
    return message

def censor_users(user):
    if user.startswith("support"):
        return "Agent"
    elif user in CLIENTS:
        return "Client"
    else:
        raise ValueError(f"unknown client: '{user}'")

def sanitize_message(match):
    user, message = match.groups()
    return f"{censor_users(user):<6} : {censor_bad_words(message)}"

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I help you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you sure it's not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You're right!
"""

print(re.sub(ENTRY_PATTERN, sanitize_message, transcript))

Agent  : What can I help you with?
Client : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  : Are you sure it's not your caps lock?
Client : 😤! You're right!

[(.+)]은 대괄호로 묶인 모든 문자 시퀀스와 일치합니다. 캡처 그룹은 johndoe.와 같은 사용자 이름 문자열을 선택합니다.
[-T:+{25}마지막 섹션에서 살펴본 타임스탬프와 일치합니다. 최종 성적표에는 타임스탬프를 사용하지 않으므로 대괄호로 캡처되지 않습니다.
:리터럴 콜론과 일치합니다. 콜론은 메시지 메타데이터와 메시지 자체 사이의 구분 기호로 사용됩니다.
(.+)메시지가 될 줄 끝까지의 모든 문자 시퀀스와 일치합니다.

참고: 항목 정규식 정의는 Python의 암시적 문자열 연결:을 사용합니다.

ENTRY_PATTERN = (
    r"\[(.+)\] "  # User string, discarding square brackets
    r"[-T:+\d]{25} "  # Time stamp
    r": "  # Separator
    r"(.+)"  # Message
)

기능적으로 이는 모든 것을 하나의 단일 문자열로 작성하는 것과 같습니다: r”[(.+)] [-T:+{25} : (.+)“. 긴 정규식 패턴을 별도의 줄에 구성하면 패턴을 여러 단위로 나눌 수 있어 가독성이 높아질 뿐만 아니라 주석도 삽입할 수 있습니다.

두 그룹은 사용자 문자열과 메시지입니다. .groups() 메소드는 이를 문자열 튜플로 반환합니다. sanitize_message() 함수에서는 먼저 압축 풀기를 사용하여 두 문자열을 변수에 할당합니다.

def sanitize_message(match):
    user, message = match.groups()
    return f"{censor_users(user):<6} : {censor_bad_words(message)}"

출처 : https://realpython.com/replace-string-python/

'python > basic' 카테고리의 다른 글

Python 문자열에 하위 문자열이 포함되어 있는지 확인하는 방법 (1)	2023.12.17
Python 개발자를 위한 HTML 및 CSS (0)	2023.12.14
Python 표준 REPL 코드와 아이디어를 빠르게 시험해 보세요 (0)	2023.12.12
Python과 PySimpleGUI를 사용하여 행맨 게임 구축 (1)	2023.12.11
Python의 효율적인 문자열 연결 (0)	2023.12.10

우대현

우대현

태그

최근글

댓글

공지사항

아카이브